All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] hfsplus: add journal replay
@ 2014-02-24 10:05 Sergei Antonov
  2014-02-24 10:21 ` Vyacheslav Dubeyko
  2014-03-09 16:36 ` Vyacheslav Dubeyko
  0 siblings, 2 replies; 17+ messages in thread
From: Sergei Antonov @ 2014-02-24 10:05 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Sergei Antonov, Al Viro, Christoph Hellwig, Andrew Morton,
	Vyacheslav Dubeyko, Hin-Tak Leung

v1->v2 changes:
* Use WRITE_SYNC instead of WRITE to replay sectors.
* Return a more proper -ENOTSUPP instead on -EIO on unsupported journal flags.
* Updated on "Advantages over" section in description.
* Fixed typos in comments and patch description.
* journal_swab field has been removed from hfsplus_sb_info, now a more local
  hfs_replay_data::need_swab is used instead. Suggested by Hin-Tak Leung.
* This line of code having no effect removed: "rd.header = rd.header;".

Add journal replay functionality. If filesystem is mounted for read-write and
a non-empty journal is found, the replay action will be done. The original
behavior of switching to read-only mode in presence of journal by default is
retained since the driver does not support journaled metadata modification
anyway. After an attempt to replay has been done, the mount procedure repeats
loading volume header to get its updated state. Having performed journal replay
we do not run a risk of stumbling on inconsistency when working with the volume,
so this functionality adds stability, while not changing any existing behavior.
Mount logic in super.c is changed as conservatively as possible.

The replay code checks the integrity of the whole journal before actually
replaying sectors. If any error (e.g. checksum mismatch, invalid/misaligned
offsets) is detected, the volume will remain unchanged and the journal will not
be cleared.

The code supports checksums for replay data in addition to checksums for header
and lists. Checksums for data is a relatively modern feature present on volumes
modified by modern OS X versions.

Tested with:
 drives with 512 and 4K sector sizes
 default (8 MB) and custom (1 MB, 500 MB) journal sizes
 big-endian (PowerPC) and little-endian (Intel) journal byte sexes

Advantages over a previously presented
   "[PATCH v4 00/15] hfsplus: introduce journal replay functionality"
   by a different author:
* 12 times faster mount with replay (0.5s versus 6s on my test volume)
* supports 4K sector devices
* no redundant stuff like journal initialization
* no fanatical use of macros with parameters
* misaligned fields are read through get_unaligned_be64()
* less intrusive changes
* shorter, easier to review

Signed-off-by: Sergei Antonov <saproj@gmail.com>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Christoph Hellwig <hch@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Vyacheslav Dubeyko <slava@dubeyko.com>
CC: Hin-Tak Leung <htl10@users.sourceforge.net>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
---
 fs/hfsplus/Makefile      |   3 +-
 fs/hfsplus/hfsplus_fs.h  |   8 +
 fs/hfsplus/hfsplus_raw.h |  65 +++-
 fs/hfsplus/journal.c     | 909 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/hfsplus/super.c       |  30 +-
 5 files changed, 1009 insertions(+), 6 deletions(-)
 create mode 100644 fs/hfsplus/journal.c

diff --git a/fs/hfsplus/Makefile b/fs/hfsplus/Makefile
index 683fca2..927e89c 100644
--- a/fs/hfsplus/Makefile
+++ b/fs/hfsplus/Makefile
@@ -6,6 +6,7 @@ obj-$(CONFIG_HFSPLUS_FS) += hfsplus.o
 
 hfsplus-objs := super.o options.o inode.o ioctl.o extents.o catalog.o dir.o btree.o \
 		bnode.o brec.o bfind.o tables.o unicode.o wrapper.o bitmap.o part_tbl.o \
-		attributes.o xattr.o xattr_user.o xattr_security.o xattr_trusted.o
+		attributes.o xattr.o xattr_user.o xattr_security.o xattr_trusted.o \
+		journal.o
 
 hfsplus-$(CONFIG_HFSPLUS_FS_POSIX_ACL)	+= posix_acl.o
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index 08846425b..7a45d45 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -31,6 +31,7 @@
 #define DBG_BITMAP	0x00000040
 #define DBG_ATTR_MOD	0x00000080
 #define DBG_ACL_MOD	0x00000100
+#define DBG_JOURNAL	0x00000200
 
 #if 0
 #define DBG_MASK	(DBG_EXTENT|DBG_INODE|DBG_BNODE_MOD)
@@ -166,6 +167,10 @@ struct hfsplus_sb_info {
 	u32 total_blocks;
 	u32 data_clump_blocks, rsrc_clump_blocks;
 
+	/* immutable data on the placement of the journal (if there is one) */
+	u64 journal_offset;	/* relative to volume start */
+	u64 journal_size;	/* relative to volume start */
+
 	/* mutable data from the volume header, protected by alloc_mutex */
 	u32 free_blocks;
 	struct mutex alloc_mutex;
@@ -516,6 +521,9 @@ int hfs_part_find(struct super_block *, sector_t *, sector_t *);
 int hfsplus_submit_bio(struct super_block *sb, sector_t sector,
 		void *buf, void **data, int rw);
 
+/* journal.c */
+int hfsplus_jrnl_replay(struct super_block *, u32, bool *);
+
 /* time macros */
 #define __hfsp_mt2ut(t)		(be32_to_cpu(t) - 2082844800U)
 #define __hfsp_ut2mt(t)		(cpu_to_be32(t + 2082844800U))
diff --git a/fs/hfsplus/hfsplus_raw.h b/fs/hfsplus/hfsplus_raw.h
index 8ffb3a8..73ba1b6 100644
--- a/fs/hfsplus/hfsplus_raw.h
+++ b/fs/hfsplus/hfsplus_raw.h
@@ -105,7 +105,7 @@ struct hfsplus_vh {
 	__be16 version;
 	__be32 attributes;
 	__be32 last_mount_vers;
-	u32 reserved;
+	__be32 jrnl_info_block;
 
 	__be32 create_date;
 	__be32 modify_date;
@@ -145,6 +145,69 @@ struct hfsplus_vh {
 #define HFSPLUS_VOL_JOURNALED		(1 << 13)
 #define HFSPLUS_VOL_SOFTLOCK		(1 << 15)
 
+/* Journal info */
+struct hfs_jrnl_info {
+	__be32 flags;
+	u32 unused[8];
+	__be64 journal_offset;	/* Misaligned! relative to volume start */
+	__be64 journal_size;	/* Misaligned! relative to volume start */
+} __packed;
+
+/* Journal info flags */
+#define HFSPLUS_JRNL_IN_FS		1
+#define HFSPLUS_JRNL_ON_OTHER_DEVICE	2
+#define HFSPLUS_JRNL_NEED_INIT		4
+
+/* Journal header */
+struct hfs_jrnl_header {
+	u32 magic;
+	u32 endian;
+	u64 data_begin;		/* relative to journal start */
+	u64 data_end;		/* relative to journal start */
+	u64 journal_size;
+	u32 list_size;		/* size of hfs_jrnl_list */
+	u32 checksum;
+	u32 hdr_size;		/* Header size. Tells us where journal buffer
+				 * begins. Also works as journal block size. */
+	u32 last_seq_num;		/* transaction sequence number */
+};
+
+#define HFSPLUS_JRNL_HDR_MAGIC	0x4a4e4c78
+#define HFSPLUS_JRNL_HDR_ENDIAN	0x12345678
+
+/* Journal data run. Describes sequential data stored in the journal. */
+struct hfs_jrnl_run {
+	u64 first_block;	/* Block number the run begins at. Note that it
+				 * is journal blocks, not filesystem blocks.
+				 * Value ~0 means the run should be skipped. */
+	u32 byte_len;		/* Data size in bytes. Multiple of
+				 * journal block size. */
+	u32 checksum;		/* Data checksum. */
+};
+
+/*
+ * Journal list. Describes one or more journaled data runs.
+ * It is normally a single transaction.
+ */
+struct hfs_jrnl_list {
+	u16 reserved;
+	u16 count;			/* number of runs plus 1 */
+	u32 length_with_data;		/* length of the list and data */
+	u32 checksum;			/* checksum of the first 32 bytes */
+	u32 flags;			/* see possible flags below */
+	u64 reserved1;			/* unused part of 1st fake run */
+	u32 reserved2;			/* unused part of 1st fake run */
+	u32 tr_end_or_seq_num;		/* Before sequence numbers introduction:
+					   zero value means end of transaction.
+					   After sequence numbers introduction:
+					   a non-zero sequence number. */
+	struct hfs_jrnl_run runs[0];	/* number of elements is count-1 */
+};
+
+/* Journal run list flags */
+#define HFSPLUS_JRNL_LIST_RUN_CHECKSUMS		1
+#define HFSPLUS_JRNL_LIST_TRANSACTION_STARTS	2
+
 /* HFS+ BTree node descriptor */
 struct hfs_bnode_desc {
 	__be32 next;
diff --git a/fs/hfsplus/journal.c b/fs/hfsplus/journal.c
new file mode 100644
index 0000000..3490297
--- /dev/null
+++ b/fs/hfsplus/journal.c
@@ -0,0 +1,909 @@
+/*
+ *  linux/include/linux/jrnl.h
+ *
+ * Handling of journal
+ */
+
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/bug.h>
+#include <linux/log2.h>
+#include <linux/swab.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/blkdev.h>
+#include <linux/printk.h>
+#include <linux/kernel.h>
+#include <linux/stddef.h>
+#include <linux/string.h>
+#include <asm/unaligned.h>
+#include <linux/interval_tree_generic.h>
+
+#include "hfsplus_fs.h"
+#include "hfsplus_raw.h"
+
+/*
+ * An interval tree node.
+ * Holds data to replay to sequential sectors.
+ */
+struct hfs_replay_interval {
+	/* tree node data */
+	struct rb_node rb;
+	u64 subtree_last;
+
+	/* replay data */
+	u64 start, last;	/* 512 byte sectors relative to volume start */
+	void *data;		/* the bytes to write to the above sectors */
+
+	/*
+	 * Before the interval is read and checked it is marked so and
+	 * kept in a singly-linked list with head in struct hfs_replay_data.
+	 */
+	struct hfs_replay_interval *next_to_read;
+
+	/*
+	 * Works as read and check barrier. True means do all pending reads and
+	 * checks before overwriting the data with data from later transactions.
+	 */
+	bool in_read_list;
+
+	/*
+	 * Data needed to check a checksum. Valid while in the read list.
+	 * Normally checksum_ptr is equal to data, they will only be different
+	 * in a weird case of one run included in the other. HFS+ journal
+	 * format does not exclude that. See hfsplus_replay_data_add() for
+	 * a code processing intersecting runs.
+	 */
+	bool need_check_checksum;
+	u32 checksum;
+	void *checksum_ptr;
+};
+
+/* Used to read, verifiy and collect a complete replay data from journal */
+struct hfs_replay_data {
+	/* volume information */
+	const struct hfsplus_sb_info *sbi;
+	struct hfs_jrnl_header *header;
+	u64 volume_jrnl_blocks;
+	bool need_swab;			/* need swap bytes in journal */
+
+	/* IO stuff */
+	struct block_device *bdev;
+	struct bio *bio;		/* to keep pending read requests */
+	unsigned sectors_queued;	/* how many pending sectors to read */
+	unsigned max_vecs, max_sectors;	/* bio limits */
+
+	/* the root of an interval tree with hfs_jrnl_interval as nodes */
+	struct rb_root intervals_root;
+
+	/* head of the list of intervals to read and possibly check checksum */
+	struct hfs_replay_interval *first_to_read;
+
+	/*
+	 * The last transaction sequence number from journal list. Zero value
+	 * means we have a pre-sequence-number version of journal.
+	 */
+	u32 last_list_seq_num;
+
+	/* number of 512 byte sectors read and replayed from journal */
+	u64 sectors_read;
+	u64 sectors_replayed;
+};
+
+#define START(interval) ((interval)->start)
+#define LAST(interval) ((interval)->last)
+
+INTERVAL_TREE_DEFINE(struct hfs_replay_interval, rb, u64, subtree_last,
+		     START, LAST, static, hfsplus_interval);
+
+enum {
+	LIST_CHECKSUM_LENGTH = 32,
+	HEADER_CHECKSUM_LENGTH = 44
+};
+
+/* frees memory buffers, removes interval from the interval tree */
+static void hfsplus_jrnl_free_interval(struct rb_root *root,
+				       struct hfs_replay_interval *interval)
+{
+	kfree(interval->data);
+	hfsplus_interval_remove(interval, root);
+	kfree(interval);
+}
+
+/* to be called after we finish working with replay data */
+static void hfsplus_replay_data_free(struct hfs_replay_data *rd)
+{
+	struct hfs_replay_interval *interval;
+
+	/* free bio */
+	if (rd->bio)
+		bio_put(rd->bio);
+	rd->bio = 0;
+
+	/* free all replay intervals */
+	while ((interval = hfsplus_interval_iter_first(
+			&rd->intervals_root, 0, U64_MAX))) {
+		hfsplus_jrnl_free_interval(&rd->intervals_root, interval);
+	}
+	rd->intervals_root = RB_ROOT;
+}
+
+static int hfsplus_jrnl_process_info(struct super_block *sb,
+				     const struct hfs_jrnl_info *info)
+{
+	struct hfsplus_sb_info *sbi = HFSPLUS_SB(sb);
+	const u64 volume_length = (u64)sbi->total_blocks <<
+		sbi->alloc_blksz_shift;
+
+	/*
+	 * Check the flags and proceed only if the journal is initialized
+	 * and resides on this volume.
+	 */
+	if (be32_to_cpu(info->flags) != HFSPLUS_JRNL_IN_FS) {
+		pr_err("unsupported journal info flags 0x%x\n",
+		       be32_to_cpu(info->flags));
+		return -ENOTSUPP;
+	}
+
+	/*
+	 * Extract journal placement data from unaligned fields
+	 * and store them in superblock info.
+	 */
+	sbi->journal_offset = get_unaligned_be64(&info->journal_offset);
+	sbi->journal_size = get_unaligned_be64(&info->journal_size);
+
+	/*
+	 * Validate journal placement. It is determined by
+	 * offset and size in bytes realative to the start of the volume.
+	 * These values are sane when sector-aligned and not outside the volume.
+	 */
+	if (!sbi->journal_size ||
+	    (sbi->journal_offset & (hfsplus_min_io_size(sb) - 1)) ||
+	    (sbi->journal_size & (hfsplus_min_io_size(sb) - 1)) ||
+	    sbi->journal_offset >= volume_length ||
+	    sbi->journal_size > volume_length - sbi->journal_offset) {
+		pr_err("bad journal placement: offset=0x%llx size=0x%llx\n",
+		       sbi->journal_offset, sbi->journal_size);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+/* check if byte offset is inside journal buffer */
+static inline bool is_inside_buffer(const struct hfs_jrnl_header *header,
+				    u64 offset /* relative to journal start */)
+{
+	/*
+	 * The journal buffer begins right after the header,
+	 * ends at the end of the journal.
+	 */
+	return offset >= header->hdr_size && offset <= header->journal_size;
+}
+
+static u32 hfsplus_jrnl_checksum(const void *data, unsigned len)
+{
+	unsigned i;
+	u32 checksum = 0;
+	const u8 *ptr = data;
+
+	for (i = 0; i < len; i++, ptr++)
+		checksum = (checksum << 8) ^ (checksum + *ptr);
+	return ~checksum;
+}
+
+static int hfsplus_jrnl_process_header(struct super_block *sb,
+				       struct hfs_jrnl_header *header,
+				       bool *need_swab)
+{
+	struct hfsplus_sb_info *sbi = HFSPLUS_SB(sb);
+	u32 header_checksum, calculated_checksum;
+
+	/*
+	 * Deal with checksums.
+	 * Save the original checksum value, replace it with zero, recalculate.
+	 */
+	header_checksum = header->checksum;
+	header->checksum = 0;
+	calculated_checksum = hfsplus_jrnl_checksum(header,
+						    HEADER_CHECKSUM_LENGTH);
+
+	/*
+	 * Check magic fields, determine if journal needs byte swapping,
+	 * and convert fields accordingly.
+	 */
+	if (HFSPLUS_JRNL_HDR_MAGIC == header->magic &&
+	    HFSPLUS_JRNL_HDR_ENDIAN == header->endian) {
+		/* The journal uses the same endianness as CPU. */
+		*need_swab = false;
+	} else if (HFSPLUS_JRNL_HDR_MAGIC == swab32(header->magic) &&
+		   HFSPLUS_JRNL_HDR_ENDIAN == swab32(header->endian)) {
+		/* The journal needs byte swapping. */
+		*need_swab = true;
+		header->data_begin = swab64(header->data_begin);
+		header->data_end = swab64(header->data_end);
+		header->journal_size = swab64(header->journal_size);
+		header->list_size = swab32(header->list_size);
+		header_checksum = swab32(header_checksum);
+		header->hdr_size = swab32(header->hdr_size);
+		header->last_seq_num = swab32(header->last_seq_num);
+	} else {
+		pr_err("unrecognized journal header\n");
+		return -EIO;
+	}
+
+	/* validate fields */
+	if (calculated_checksum != header_checksum) {
+		pr_err("journal header checksum mismatch (0x%x != 0x%x)\n",
+		       calculated_checksum, header_checksum);
+		return -EIO;
+	}
+	if (header->journal_size != sbi->journal_size) {
+		pr_err("different journal sizes in journal info and header\n");
+		return -EIO;
+	}
+	if (!header->hdr_size ||
+	    header->hdr_size % hfsplus_min_io_size(sb) ||
+	    !is_power_of_2(header->hdr_size) ||
+	    header->hdr_size >= sbi->journal_size) {
+		pr_err("invalid journal header size %d\n", header->hdr_size);
+		return -EIO;
+	}
+	if (!header->list_size || header->list_size % hfsplus_min_io_size(sb)) {
+		pr_err("journal list size %d is not a multiple of minimum io size %d\n",
+		       header->list_size, hfsplus_min_io_size(sb));
+		return -EIO;
+	}
+	if (!is_inside_buffer(header, header->data_begin) ||
+	    !is_inside_buffer(header, header->data_end)) {
+		pr_err("bad journal data offsets\n");
+		return -EIO;
+	}
+
+	hfs_dbg(JOURNAL,
+		"journal header: need swap bytes = %s, last transaction sequence number 0x%x\n",
+		*need_swab ? "true" : "false",
+		header->last_seq_num);
+	return 0;
+}
+
+static u64 hfsplus_jrnl_data_length(const struct hfs_jrnl_header *header)
+{
+	u64 buffer_size;
+
+	/*
+	 * Journal data is stored in a circular journal buffer.
+	 * If data does not wrap around the end the buffer, simply subtract.
+	 * Begin==end means zero data length, subtraction will result in it.
+	 */
+	if (header->data_begin <= header->data_end)
+		return header->data_end - header->data_begin;
+
+	/* data wraps around, so it is divided into two pieces */
+	buffer_size = header->journal_size - header->hdr_size;
+	return buffer_size - (header->data_begin - header->data_end);
+}
+
+/* executes all pending reads and checks checksums */
+static int hfsplus_jrnl_read_now(struct hfs_replay_data *rd)
+{
+	int error;
+	struct hfs_replay_interval *interval;
+
+	if (rd->bio && rd->bio->bi_vcnt) {
+		/* read synchronously */
+		hfs_dbg(JOURNAL, "submitting bio read nr_vecs %u nr_sectors %u\n",
+			rd->bio->bi_vcnt, rd->sectors_queued);
+		error = submit_bio_wait(READ, rd->bio);
+		if (error)
+			return error;
+
+		/*
+		 * We've done with the current bio.
+		 * Next call to hfsplus_jrnl_queue_read will create a new one.
+		 */
+		bio_put(rd->bio);
+		rd->bio = 0;
+	}
+
+	/* process a list of read intervals */
+	for (interval = rd->first_to_read;
+	     interval;
+	     interval = interval->next_to_read) {
+		u32 calculated_checksum;
+
+		/* cast is ok - intervals are limited by a 32-bit run length */
+		const u32 byte_len =
+			(u32)(interval->last - interval->start + 1) << 9;
+
+		interval->in_read_list = false; /* read list will be emptied */
+
+		if (!interval->need_check_checksum)
+			continue;
+
+		/* calculate and compare checksum */
+		calculated_checksum = hfsplus_jrnl_checksum(
+			interval->checksum_ptr, byte_len);
+		if (interval->checksum != calculated_checksum) {
+			pr_err("invalid journal run 0x%llx - 0x%llx checksum (0x%x != 0x%x)\n",
+			       interval->start, interval->last,
+			       interval->checksum, calculated_checksum);
+			return -EIO;
+		}
+	}
+	rd->first_to_read = 0;	/* empty read list */
+
+	return 0;
+}
+
+/*
+ * Adds an IO vector to read sequential bytes from the device.
+ * Does IO if bio is full.
+ *
+ * Isolates callers from bio limitations and IO vector handling.
+ */
+static int hfsplus_jrnl_queue_bio(struct hfs_replay_data *rd,
+				  char *buf, u32 len)
+{
+	int error = 0;
+	u32 len_sectors = len >> 9;
+
+	/* split too long request */
+	while (len_sectors > rd->max_sectors) {
+		const u32 split_len = rd->max_sectors << 9;
+
+		error = hfsplus_jrnl_queue_bio(rd, buf, split_len);
+		if (error)
+			return error;
+		buf += split_len;
+		len -= split_len;
+		len_sectors = len >> 9;
+	}
+
+	/* do IO if bio is full by a sector limit */
+	if (rd->max_sectors - rd->sectors_queued < len_sectors) {
+		error = hfsplus_jrnl_read_now(rd);
+		if (error)
+			return error;
+	}
+
+	/* create and initialize a bio structure if it does not exist */
+	if (!rd->bio) {
+		const sector_t sector = rd->sbi->part_start +
+			((rd->sbi->journal_offset + rd->header->data_begin) >>
+			 9);
+
+		rd->bio = bio_alloc(GFP_NOIO, rd->max_vecs);
+		if (!rd->bio)
+			return -ENOMEM;
+		rd->bio->bi_bdev = rd->bdev;
+		rd->bio->bi_vcnt = 0;
+		rd->bio->bi_iter.bi_size = 0;
+		rd->bio->bi_iter.bi_sector = sector;
+		rd->sectors_queued = 0;
+	}
+
+	/* add a new IO vector */
+	rd->bio->bi_io_vec[rd->bio->bi_vcnt].bv_page
+		= virt_to_page(buf);
+	rd->bio->bi_io_vec[rd->bio->bi_vcnt].bv_offset
+		= offset_in_page(buf);
+	rd->bio->bi_io_vec[rd->bio->bi_vcnt].bv_len = len;
+	rd->bio->bi_vcnt++;
+	rd->bio->bi_iter.bi_size += len;
+	rd->sectors_queued += len_sectors;
+
+	/* do IO if bio is full by an iovecs limit */
+	if (rd->bio->bi_vcnt == rd->bio->bi_max_vecs)
+		error = hfsplus_jrnl_read_now(rd);
+
+	return error;
+}
+
+/*
+ * Queue read request for journal data. Requests are placed at sequentially
+ * increasing offsets (header->data_begin is increased every time) so that
+ * the whole journal data from data_begin to data_end will be eventually read.
+ *
+ * Isolates callers from buffers's circular nature and IO.
+ */
+static int hfsplus_jrnl_queue_read(struct hfs_replay_data *rd,
+				   char *buf, u32 len)
+{
+	int error = 0;
+	const u64 data_length = hfsplus_jrnl_data_length(rd->header);
+	const u64 len_to_buffer_end =
+		rd->sbi->journal_size - rd->header->data_begin;
+
+	/* make sure we can read the whole buffer */
+	if (data_length < len) {
+		pr_err("journal data ends unexpectedly: want to read 0x%x bytes, available 0x%llx bytes\n",
+		       len, data_length);
+		return -EIO;
+	}
+
+	if (len >= len_to_buffer_end) {
+		/* the requested data is split into two pieces */
+		error = hfsplus_jrnl_queue_bio(rd, buf, len_to_buffer_end);
+		buf += len_to_buffer_end;
+		len -= len_to_buffer_end;
+
+		/* finish current bio, we are jumping to another sector */
+		if (!error)
+			error = hfsplus_jrnl_read_now(rd);
+
+		/* wrap around the journal data pointer */
+		rd->header->data_begin = rd->header->hdr_size;
+	}
+
+	if (!error && len) {
+		error = hfsplus_jrnl_queue_bio(rd, buf, len);
+
+		/* advance the journal data pointer */
+		rd->header->data_begin += len;
+	}
+
+	return error;
+}
+
+/* reads data from the journal circular buffer at header->begin */
+static int hfsplus_jrnl_read_buf(struct hfs_replay_data *rd,
+				 void *buf, u32 len)
+{
+	/* queue a new read iovec */
+	int error = hfsplus_jrnl_queue_read(rd, buf, len);
+
+	/* execute all queued iovecs */
+	if (!error)
+		error = hfsplus_jrnl_read_now(rd);
+	return error;
+}
+
+/*
+ * Receives information on where the next byte_len bytes in journal data are
+ * supposed to be replayed to, what is their checksum (if available).
+ * Finds the interval the data belongs to or creates a new interval.
+ * Then queues read and check operations.
+ * byte_len is a multiple of journal block size.
+ */
+static int hfsplus_replay_data_add(struct hfs_replay_data *rd,
+				   u64 jrnl_block, u32 byte_len,
+				   bool has_checksum, u32 checksum)
+{
+	/* turn journal blocks into 512 byte bio sectors */
+	const unsigned shift = ilog2(rd->header->hdr_size) - 9;
+	const u64 start = jrnl_block << shift;
+	const u64 last = start + (byte_len >> 9) - 1;
+
+	int error;
+	struct hfs_replay_interval *interval;
+	char *read_ptr;
+
+	/* check for intersections with existing intervals */
+	interval = hfsplus_interval_iter_first(&rd->intervals_root,
+					       start, last);
+	while (interval) {
+		/*
+		 * Intersection is found. We are going to overwrite its
+		 * memory, so check the barrier.
+		 */
+		if (interval->in_read_list) {
+			error = hfsplus_jrnl_read_now(rd);
+			if (error)
+				return error;
+		}
+
+		/* new data is completely included in the found interval */
+		if (interval->start <= start && interval->last >= last) {
+			hfs_dbg(JOURNAL,
+				"adding to an existing interval 0x%llx - 0x%llx\n",
+				interval->start, interval->last);
+			read_ptr = interval->data;
+			read_ptr += (start - interval->start) << 9;
+			goto queue_and_return;
+		}
+
+		/* new data completely overrides the found interval */
+		if (interval->start >= start && interval->last <= last) {
+			hfs_dbg(JOURNAL,
+				"removing an existing interval 0x%llx - 0x%llx\n",
+				interval->start, interval->last);
+			hfsplus_jrnl_free_interval(&rd->intervals_root,
+						   interval);
+		}
+
+		/* new data overrides the found interval at the end */
+		if (interval->start < start) {
+			hfs_dbg(JOURNAL,
+				"left intersection with 0x%llx - 0x%llx\n",
+				interval->start, interval->last);
+
+			/* shorten interval and reinsert */
+			hfsplus_interval_remove(interval, &rd->intervals_root);
+			interval->last = start - 1;
+			hfsplus_interval_insert(interval, &rd->intervals_root);
+		}
+
+		/* new data overrides the found interval at the beginning */
+		if (interval->last > last) {
+			const u32 intersection_len =
+				(last - interval->start) << 9;
+			hfs_dbg(JOURNAL,
+				"right intersection with 0x%llx - 0x%llx\n",
+				interval->start, interval->last);
+
+			/* shorten interval, move data and reinsert */
+			hfsplus_interval_remove(interval, &rd->intervals_root);
+			memmove(interval->data,
+				interval->data + intersection_len,
+				(interval->last - last) << 9);
+			interval->start = last + 1;
+			hfsplus_interval_insert(interval, &rd->intervals_root);
+		}
+
+		/* start searching anew after we deleted or reinserted */
+		interval = hfsplus_interval_iter_first(&rd->intervals_root,
+						       start, last);
+	}
+
+	/* allocate a new interval and its data buffer */
+	interval = kmalloc(sizeof(*interval), GFP_KERNEL);
+	if (!interval)
+		return -ENOMEM;
+	interval->data = kmalloc(byte_len, GFP_KERNEL);
+	if (!interval->data) {
+		kfree(interval);
+		return -ENOMEM;
+	}
+
+	interval->start = start;
+	interval->last = last;
+	read_ptr = interval->data;
+	hfs_dbg(JOURNAL, "inserting a new interval 0x%llx - 0x%llx\n",
+		start, last);
+	hfsplus_interval_insert(interval, &rd->intervals_root);
+
+queue_and_return:
+	error = hfsplus_jrnl_queue_read(rd, read_ptr, byte_len);
+	if (error)
+		return error;
+	interval->in_read_list = true;
+	interval->need_check_checksum = has_checksum;
+	interval->checksum = checksum;
+	interval->checksum_ptr = read_ptr;
+	interval->next_to_read = rd->first_to_read;
+	rd->first_to_read = interval;
+	return 0;
+}
+
+/*
+ * Receives one journal run, converts it, validates it,
+ * queues read and check operations for replay data found in it.
+ */
+static int hfsplus_jrnl_run(struct hfs_replay_data *rd,
+			    struct hfs_jrnl_run *run,
+			    u32 *list_data_len, bool has_checksum)
+{
+	const u64 SKIP_RUN = ~0;
+
+	/* skip the run if it is marked by all 64 bits set */
+	if (run->first_block == SKIP_RUN)
+		return 0;
+
+	/* swap bytes if needed */
+	if (rd->need_swab) {
+		run->first_block = swab64(run->first_block);
+		run->byte_len = swab32(run->byte_len);
+		run->checksum = swab32(run->checksum);
+	}
+
+	/*
+	 * Validate length: it has to be a multiple of journal block size,
+	 * it cannot go beyond data length specified in the list.
+	 */
+	if (run->byte_len % rd->header->hdr_size) {
+		pr_err("journal run length 0x%x is not a multiple of journal block size 0x%x\n",
+		       run->byte_len, rd->header->hdr_size);
+		return -EIO;
+	}
+	if (!run->byte_len || run->byte_len > *list_data_len) {
+		pr_err("journal run length 0x%x is out of bounds\n",
+		       run->byte_len);
+		return -EIO;
+	}
+
+	/* it has to be inside volume bounds of course */
+	if (run->first_block >= rd->volume_jrnl_blocks ||
+	    rd->volume_jrnl_blocks - run->first_block <
+	    run->byte_len / rd->header->hdr_size) {
+		pr_err("invalid journal run block (0x%llx, %x)\n",
+		       run->first_block, run->byte_len);
+		return -EIO;
+	}
+
+	*list_data_len -= run->byte_len;
+	rd->sectors_read += run->byte_len >> 9;
+	return hfsplus_replay_data_add(rd, run->first_block, run->byte_len,
+				       has_checksum, run->checksum);
+}
+
+/* used during journal list validation */
+static inline unsigned max_list_count(const struct hfs_jrnl_header *header)
+{
+	const unsigned space_for_runs = header->list_size -
+		offsetof(struct hfs_jrnl_list, runs);
+
+	/* Remember to add 1 for one fake run. See structure definition. */
+	return space_for_runs / sizeof(struct hfs_jrnl_run) + 1;
+}
+
+/*
+ * Receives one journal list, converts it, validates it,
+ * queues read and check operations for replay data found in it.
+ */
+static int hfsplus_jrnl_list(struct hfs_replay_data *rd,
+			     struct hfs_jrnl_list *list)
+{
+	int error, i;
+	u32 list_checksum, calculated_checksum;
+	bool transaction_starts, run_checksums;
+	u32 list_data_len;
+
+	/*
+	 * Deal with checksums.
+	 * Save the original checksum value, replace it with zero, recalculate.
+	 */
+	list_checksum = list->checksum;
+	list->checksum = 0;
+	calculated_checksum = hfsplus_jrnl_checksum(list, LIST_CHECKSUM_LENGTH);
+
+	/* swap bytes if needed */
+	if (rd->need_swab) {
+		list->count = swab16(list->count);
+		list->length_with_data = swab32(list->length_with_data);
+		list_checksum = swab32(list_checksum);
+		list->flags = swab32(list->flags);
+		list->tr_end_or_seq_num = swab32(list->tr_end_or_seq_num);
+	}
+
+	/* validate fields */
+	if (calculated_checksum != list_checksum) {
+		pr_err("journal list checksum mismatch (0x%x != 0x%x)\n",
+		       calculated_checksum, list_checksum);
+		return -EIO;
+	}
+	if (!list->count || list->count > max_list_count(rd->header)) {
+		pr_err("journal list is invalid: count=0x%x\n", list->count);
+		return -EIO;
+	}
+	if (list->length_with_data < rd->header->list_size) {
+		pr_err("invalid journal list length (0x%x < 0x%x)\n",
+		       list->length_with_data, rd->header->list_size);
+		return -EIO;
+	}
+	list_data_len = list->length_with_data - rd->header->list_size;
+	if (list_data_len > hfsplus_jrnl_data_length(rd->header)) {
+		pr_err("journal list data length is longer than remaining data\n");
+		return -EIO;
+	}
+	if (list_data_len % rd->header->hdr_size) {
+		pr_err("journal list data length 0x%x is not a multiple of journal block size 0x%x\n",
+		       list_data_len, rd->header->hdr_size);
+		return  -EIO;
+	}
+
+	rd->last_list_seq_num = list->tr_end_or_seq_num;
+	list->count--;
+	transaction_starts =
+		!!(list->flags & HFSPLUS_JRNL_LIST_TRANSACTION_STARTS);
+	run_checksums = !!(list->flags & HFSPLUS_JRNL_LIST_RUN_CHECKSUMS);
+
+	hfs_dbg(JOURNAL, "processing journal list with %u runs, 0x%x bytes of data, trstart = %d, run checksums = %d, seq_num=0x%x\n",
+		list->count, list_data_len,
+		transaction_starts, run_checksums,
+		list->tr_end_or_seq_num);
+
+	/* iterate over runs */
+	for (i = 0, error = 0; !error && i < list->count; i++) {
+		error = hfsplus_jrnl_run(rd, list->runs+i, &list_data_len,
+					 run_checksums);
+	}
+
+	/* if all went right we have zero unprocessed data left */
+	if (!error && list_data_len) {
+		pr_err("journal list data length exceeds a sum of run lengths in the list\n");
+		error = -EIO;
+	}
+
+	return error;
+}
+
+/* iterates over all journal lists, collects and checks replay data */
+static int hfsplus_jrnl_read_lists(struct hfs_replay_data *rd)
+{
+	int error;
+	void *buf;
+
+	buf = kmalloc(rd->header->list_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	/* iterate over all journal lists from older transactions to newer */
+	do {
+		error = hfsplus_jrnl_read_buf(rd, buf, rd->header->list_size);
+		if (!error)
+			error = hfsplus_jrnl_list(rd, buf);
+	} while (!error && hfsplus_jrnl_data_length(rd->header));
+
+	/* do all pending reads and checks */
+	if (!error)
+		error = hfsplus_jrnl_read_now(rd);
+
+	hfs_dbg(JOURNAL,
+		"last seq num in journal data 0x%x, last seq num in header 0x%x\n",
+		rd->last_list_seq_num, rd->header->last_seq_num);
+
+	kfree(buf);
+	return error;
+}
+
+/* replay sectors read from journal */
+static int hfsplus_jrnl_write_intervals(struct hfs_replay_data *rd)
+{
+	int error = 0;
+	struct hfs_replay_interval *interval;
+
+	for (interval = hfsplus_interval_iter_first(&rd->intervals_root,
+						    0, U64_MAX);
+	     interval && !error;
+	     interval = hfsplus_interval_iter_next(interval, 0, U64_MAX)) {
+		struct bio *bio;
+
+		/* cast is ok - intervals are limited by a 32-bit run length */
+		const u32 len_sectors =
+			(u32)(interval->last - interval->start + 1);
+
+		hfs_dbg(JOURNAL, "replaying 0x%llx - 0x%llx\n",
+			interval->start, interval->last);
+
+		bio = bio_alloc(GFP_NOIO, 1);
+		if (!bio) {
+			error = -ENOMEM;
+			break;
+		}
+		bio->bi_bdev = rd->bdev;
+		bio->bi_vcnt = 1;
+		bio->bi_iter.bi_size = len_sectors << 9;
+		bio->bi_iter.bi_sector = rd->sbi->part_start + interval->start;
+		bio->bi_io_vec[0].bv_page = virt_to_page(interval->data);
+		bio->bi_io_vec[0].bv_offset = offset_in_page(interval->data);
+		bio->bi_io_vec[0].bv_len = len_sectors << 9;
+		error = submit_bio_wait(WRITE_SYNC, bio);
+		if (error)
+			pr_err("journal replay write error %d\n", error);
+		else
+			rd->sectors_replayed += len_sectors;
+		bio_put(bio);
+	}
+
+	return error;
+}
+
+/*
+ * After writing all sectors to the volume this function updates
+ * journal header. The header will be written in CPU endianness.
+ * This function assumes header->data_begin == header->data_end.
+ * Write mode is WRITE_FLUSH_FUA to flush previous writes first.
+ */
+static int hfsplus_jrnl_clear_header(struct super_block *sb,
+				     struct hfs_jrnl_header *header,
+				     sector_t journal_hdr_sector)
+{
+	int error;
+
+	header->magic = HFSPLUS_JRNL_HDR_MAGIC;
+	header->endian = HFSPLUS_JRNL_HDR_ENDIAN;
+	header->checksum = 0;
+	header->checksum =
+		hfsplus_jrnl_checksum(header, HEADER_CHECKSUM_LENGTH);
+	error = hfsplus_submit_bio(sb, journal_hdr_sector,
+				   header, 0, WRITE_FLUSH_FUA);
+	if (error)
+		pr_err("journal header write error %d\n", error);
+	return error;
+}
+
+/*
+ * This function is called during mount operation.
+ * Finds a journal, replays it if it is not empty.
+ */
+int hfsplus_jrnl_replay(struct super_block *sb, u32 jrnl_info_block,
+			bool *journal_was_empty)
+{
+	int error;
+	const struct hfsplus_sb_info *sbi = HFSPLUS_SB(sb);
+	const unsigned long block_shift =
+		sbi->alloc_blksz_shift - HFSPLUS_SECTOR_SHIFT;
+	struct hfs_jrnl_info *info;
+	void *buf;
+	sector_t journal_hdr_sector;
+	struct hfs_replay_data rd = {};
+
+	buf = kmalloc(hfsplus_min_io_size(sb), GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	/* step 1: journal info */
+	error = hfsplus_submit_bio(sb, sbi->part_start +
+				   (jrnl_info_block << block_shift),
+				   buf, (void **)&info, READ);
+	if (error) {
+		pr_err("journal info read error %d\n", error);
+		goto out;
+	}
+	error = hfsplus_jrnl_process_info(sb, info);
+	if (error)
+		goto out;
+	journal_hdr_sector = sbi->part_start +
+		(sbi->journal_offset >> HFSPLUS_SECTOR_SHIFT);
+
+	/* step 2: journal header */
+	error = hfsplus_submit_bio(sb, journal_hdr_sector,
+				   buf, (void **)&rd.header, READ);
+	if (error) {
+		pr_err("journal header read error %d\n", error);
+		goto out;
+	}
+	error = hfsplus_jrnl_process_header(sb, rd.header, &rd.need_swab);
+	if (error)
+		goto out;
+
+	*journal_was_empty = rd.header->data_begin == rd.header->data_end;
+	if (*journal_was_empty) {
+		hfs_dbg(JOURNAL, "journal is empty\n");
+		goto out;
+	}
+
+	/*
+	 * Initialize replay data structure.
+	 * It will change the header to advance data pointer as it reads data.
+	 * It needs to know the number of journal blocks on the volume.
+	 * It heeds to know bio limitations.
+	 */
+	rd.sbi = sbi;
+	rd.volume_jrnl_blocks =
+		((u64)sbi->total_blocks << sbi->alloc_blksz_shift) >>
+		ilog2(rd.header->hdr_size);
+	rd.bdev = sb->s_bdev;
+	/* get max nr_vecs, limiting at 100 in case it is very generous */
+	rd.max_vecs = min_t(unsigned, 100, bio_get_nr_vecs(sb->s_bdev));
+	if (!rd.max_vecs) {
+		error = -EIO;
+		goto out;
+	}
+	rd.max_sectors = queue_max_sectors(bdev_get_queue(sb->s_bdev));
+	rd.intervals_root = RB_ROOT;
+
+	/* step 3: journal lists */
+	error = hfsplus_jrnl_read_lists(&rd);
+	if (error)
+		goto out;
+
+	/* step 4: write transactions to the volume */
+	error = hfsplus_jrnl_write_intervals(&rd);
+	if (error)
+		goto out;
+
+	/* step 5: flush written data and update journal header */
+	error = hfsplus_jrnl_clear_header(sb, rd.header, journal_hdr_sector);
+	if (error)
+		goto out;
+
+	hfs_dbg(JOURNAL, "%llu sectors have been read from journal, %llu sectors have been written to volume\n",
+		rd.sectors_read, rd.sectors_replayed);
+	pr_info("journal has been replayed\n");
+out:
+	hfsplus_replay_data_free(&rd);
+	kfree(buf);
+	return error;
+}
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 80875aa..84b832e 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -403,6 +403,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
 		goto out_unload_nls;
 	}
 
+reread_vhdr:
 	/* Grab the volume header */
 	if (hfsplus_read_wrapper(sb)) {
 		if (!silent)
@@ -447,6 +448,31 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_op = &hfsplus_sops;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 
+	if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
+	    !(sb->s_flags & MS_RDONLY)) {
+		bool journal_was_empty;
+		int replay_error;
+
+		pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
+		sb->s_flags |= MS_RDONLY;
+
+		replay_error = hfsplus_jrnl_replay(sb,
+			be32_to_cpu(vhdr->jrnl_info_block), &journal_was_empty);
+		if (replay_error)
+			pr_err("journal replay error %d\n", replay_error);
+
+		/*
+		 * Reread volume header after replay, no matter if it succeeded
+		 * or not. We've just set read-only flag, so no further replay
+		 * attempts will happen.
+		 */
+		kfree(sbi->s_vhdr_buf);
+		kfree(sbi->s_backup_vhdr_buf);
+		sbi->s_vhdr_buf = 0;
+		sbi->s_backup_vhdr_buf = 0;
+		goto reread_vhdr;
+	}
+
 	if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) {
 		pr_warn("Filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  mounting read-only.\n");
 		sb->s_flags |= MS_RDONLY;
@@ -455,10 +481,6 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
 	} else if (vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) {
 		pr_warn("Filesystem is marked locked, mounting read-only.\n");
 		sb->s_flags |= MS_RDONLY;
-	} else if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
-			!(sb->s_flags & MS_RDONLY)) {
-		pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
-		sb->s_flags |= MS_RDONLY;
 	}
 
 	err = -EINVAL;
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] hfsplus: add journal replay
  2014-02-24 10:05 [PATCH v2] hfsplus: add journal replay Sergei Antonov
@ 2014-02-24 10:21 ` Vyacheslav Dubeyko
  2014-03-09 16:36 ` Vyacheslav Dubeyko
  1 sibling, 0 replies; 17+ messages in thread
From: Vyacheslav Dubeyko @ 2014-02-24 10:21 UTC (permalink / raw)
  To: Sergei Antonov
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton, Hin-Tak Leung

On Mon, 2014-02-24 at 11:05 +0100, Sergei Antonov wrote:

I can only repeat that your patch is only cheating for me. Because it
needs to cooperate and respect authorship.

So, I don't want to be in CC of this patch.

With the best regards,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] hfsplus: add journal replay
  2014-02-24 10:05 [PATCH v2] hfsplus: add journal replay Sergei Antonov
  2014-02-24 10:21 ` Vyacheslav Dubeyko
@ 2014-03-09 16:36 ` Vyacheslav Dubeyko
  2014-03-09 22:50   ` Sergei Antonov
  1 sibling, 1 reply; 17+ messages in thread
From: Vyacheslav Dubeyko @ 2014-03-09 16:36 UTC (permalink / raw)
  To: Sergei Antonov
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton, Hin-Tak Leung

Hi Sergei,

On Mon, 2014-02-24 at 11:05 +0100, Sergei Antonov wrote:

We have really different vision of implementation for this functionality
and I have many remarks about your implementation. So, I suggest to
discuss principal points, firstly. We need to achieve understanding of
each other and, then, it makes sense to implement joint solution.
Currently, both of our implementations have shortcomings, from my point
of view. 
 
> The replay code checks the integrity of the whole journal before actually
> replaying sectors. If any error (e.g. checksum mismatch, invalid/misaligned
> offsets) is detected, the volume will remain unchanged and the journal will not
> be cleared.

I think that such approach is overkill. And, moreover, this is a wrong
approach.

First of all, journal contains sequence of transactions. Technical Note
TN1150: "A group of related changes is called a transaction. When all of
the changes of a transaction have been written to their normal locations
on disk, that transaction has been committed, and is removed from the
journal. The journal may contain several transactions. Copying changes
from all transactions to their normal locations on disk is called
replaying the journal". So, transactions can be committed one after
another and it leaves HFS+ volume in consistent state after each commit.
It doesn't make sense to check all transactions preliminary and only
then to commit.

Secondly, for example, you had sudden power-off. It means that last
transaction will be not finished. In other words, simply broken or
invalid. In your approach you will reject the all transactions in the
journal. But the goal of the journal is to commit all valid transactions
before invalid one. Otherwise, file system remains in corrupted state
without journal replay.

And, finally, if you firstly check all transactions then it means that
you need to read all of its in memory before real replaying. And, as far
as I can judge, you simply keep all sectors in memory before replay. For
example, your system has 512 MB of RAM and HFS+ volume has fully filled
journal with 512 MB in size. How lucky will you with journal replay on
such system without a swap?

So, I use another approach in my code. I check transaction and then to
replay it. And I think that transaction based journal replay is
reasonable and right way. 

> Tested with:
>  drives with 512 and 4K sector sizes

I've tested my code with 512 bytes, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32
KB, 64 KB block sizes. And I think that such testing is really
necessary. I've found and fix several bugs namely for block sizes are
greater than 4 KB. Your code is really complex and I am afraid that 8 -
64 KB block sizes will reveal some bugs.

>  default (8 MB) and custom (1 MB, 500 MB) journal sizes
>  big-endian (PowerPC) and little-endian (Intel) journal byte sexes

My last version of the patchset is ready for as little-endian as
big-endian journal. So, I can't see difference in our patches here.

> Advantages over a previously presented
>    "[PATCH v4 00/15] hfsplus: introduce journal replay functionality"
>    by a different author:

My name is not "different author". My name is Vyacheslav Dubeyko. If you
want to insult me then you had achieved the goal.

> * 12 times faster mount with replay (0.5s versus 6s on my test volume)

OK. Let's talk about it.

I had thought during implementation about more optimal bio submission.
But I simply had chosen hfsplus_submit_bio() method. Because journal
replay is not so frequent operation that it takes place only during
mount operation. Usually, the journal contains several transactions
after clean umount (or it can be simply empty). It can contain many
transaction in the case of sudden power-off, for example. But such
situation is not so frequent and it is more important for a end-user to
have reliable journal replay. I suppose that every user is ready to wait
a spare second (or even a minute) for achieving consistent state of a
volume after mount. Because journal replay has goal to guarantee file
system consistency but not performance.

OK. I believe also that performance is important. But you had suggested
really big and complicated implementation that, moreover, it is really
hard to understand at one glance. I can suggest much, much and much
simpler solution. First of all, transaction based journal replay is my
principal position. Secondly, it is possible to submit reading of all
transaction's blocks before starting to process transaction. Such
submission can be made by means of optimized version of
hfsplus_submit_bio() or likewise analog. Thirdly, moreover, it is
possible to calculate block's checksum in bio->bi_end_io method for the
case of block sizes lesser than 8 KB (512, 1 KB, 2 KB, 4 KB). Such
method is called at the end of block processing by block layer.

> * supports 4K sector devices

Could you point out where do you make special efforts for it? And why do
you need to implement something in this direction?

I tried to find likewise implementation in your code but I failed. I
suspect that I misunderstand something.

> * no redundant stuff like journal initialization

My patchset is first step in implementation of full-featured journaling
support functionality for HFS+ driver. So, to implement journal
initialization functionality with journal replay functionality is
reasonable action. Because these two features are related to each
others, from my viewpoint.

Of course, it is possible not initialize journal during mount on current
step of implementation. But to implement the journal initialization
functionality on the step of implementation of journal replay
functionality is really necessary.

But, I think that it needs to initialize journal during mount of HFS+
volume (for the case of kJIJournalNeedInitMask flag). Because it will
coincide with Technical Note TN1150. And if we have bugs in this
functionality then it needs to fix it. Anyway, if we want to have
full-featured journaling support functionality for HFS+ driver then we
need to implement the journal initialization functionality.

> * no fanatical use of macros with parameters

I disagree that I use macros with fanaticism. I agree that it is really
bad taste to use macros in C++. But it is common technique for C
language. And this technique is widely used in Linux kernel. So, it is
not my invention to use macros.

Moreover, macros in my pathset are simple code patterns that it can be
seen as simple abbreviation. For example:

#define JIB_BLOCK(sb) \
        (be32_to_cpu(HFSPLUS_SB(sb)->s_vhdr->journal_info_block))

I am unable to see anything bad in such macro. It shortens and to make
more sensible a code in the place of using.

But why do you use with fanaticism such macros as hfs_dbg(),
be32_to_cpu() and so on? Stop using macros in your patch if you appeal
about macro abstinence.

> * shorter, easier to review

Really? :)

It is really hard to review your patch for me. Because it is big,
complex and it doesn't be divided on logical parts.

Your patch contains 1015 lines of code at whole.

My patch contains 1598 lines of code at whole. If I exclude: (1) several
big comments - 99 lines, (2) journal initialization functionality - ~192
lines, (3) "norecovery" mount option functionality - ~134 lines, then, I
will have about 1173 lines of code in my patchset. 

So, 1015 lines of code vs. 1173 lines. It is not so big difference. And
if you adopt my comments ~100 lines (I believe that my comments are
really reasonable) and "norecovery" mount option ~100 lines of code then
the difference will vanish completely.

> @@ -166,6 +167,10 @@ struct hfsplus_sb_info {
>  	u32 total_blocks;
>  	u32 data_clump_blocks, rsrc_clump_blocks;
>  
> +	/* immutable data on the placement of the journal (if there is one) */
> +	u64 journal_offset;	/* relative to volume start */
> +	u64 journal_size;	/* relative to volume start */
> +

My patchset is the first step in implementation of full featured journal
support in HFS+ driver. So, I had suggested in my patchset to store in
superblock info such structure is related to journal info:

struct hfsplus_journal {
        struct mutex jnl_lock;

        /* Journal info block specific */
        void *jib_buf;
        struct hfsplus_journal_info_block *jib;

        /* Journal header specific */
        void *jh_buf;
        struct hfsplus_journal_header *jh;
        u8 endianness;

        /* TODO: Pointer of JBD */

        struct super_block *sbp;
};

I suppose that such structure is reasonable and it contains all
necessary starting details for journaling on HFS+ volume.

Otherwise, if we are talking about journal replay only, then, I don't
see any necessity in keeping any info about journal. In such case, it
doesn't makes sense to add in superblock info structure something is
related to journal details. Journal replaying is all-sufficient
operation.

>  
> +/* journal.c */
> +int hfsplus_jrnl_replay(struct super_block *, u32, bool *);

I really dislike "jrnl" abbreviation. Using "journal" instead of "jrnl"
is really reasonable and sensible way. I don't think that it makes sense
to economize on 3 symbols.

I tried to understand how you process binfo array. But it is really hard
to understand the whole patch. So, I simply ask. One block list (binfo
array) can have size from 4KB till 16KB. Thereby, block list can be
placed in several file system blocks. I have implemented and have tested
special functionality for it. How your patch is ready for situation when
binfo array will be placed in several blocks?

> +
>  /* time macros */
>  #define __hfsp_mt2ut(t)		(be32_to_cpu(t) - 2082844800U)
>  #define __hfsp_ut2mt(t)		(cpu_to_be32(t + 2082844800U))
> diff --git a/fs/hfsplus/hfsplus_raw.h b/fs/hfsplus/hfsplus_raw.h
> index 8ffb3a8..73ba1b6 100644
> --- a/fs/hfsplus/hfsplus_raw.h
> +++ b/fs/hfsplus/hfsplus_raw.h
> @@ -105,7 +105,7 @@ struct hfsplus_vh {
>  	__be16 version;
>  	__be32 attributes;
>  	__be32 last_mount_vers;
> -	u32 reserved;
> +	__be32 jrnl_info_block;
>  
>  	__be32 create_date;
>  	__be32 modify_date;
> @@ -145,6 +145,69 @@ struct hfsplus_vh {
>  #define HFSPLUS_VOL_JOURNALED		(1 << 13)
>  #define HFSPLUS_VOL_SOFTLOCK		(1 << 15)
>  
> +/* Journal info */
> +struct hfs_jrnl_info {
> +	__be32 flags;
> +	u32 unused[8];
> +	__be64 journal_offset;	/* Misaligned! relative to volume start */
> +	__be64 journal_size;	/* Misaligned! relative to volume start */
> +} __packed;
> +
> +/* Journal info flags */
> +#define HFSPLUS_JRNL_IN_FS		1
> +#define HFSPLUS_JRNL_ON_OTHER_DEVICE	2
> +#define HFSPLUS_JRNL_NEED_INIT		4
> +
> +/* Journal header */
> +struct hfs_jrnl_header {
> +	u32 magic;
> +	u32 endian;
> +	u64 data_begin;		/* relative to journal start */
> +	u64 data_end;		/* relative to journal start */
> +	u64 journal_size;
> +	u32 list_size;		/* size of hfs_jrnl_list */
> +	u32 checksum;
> +	u32 hdr_size;		/* Header size. Tells us where journal buffer
> +				 * begins. Also works as journal block size. */
> +	u32 last_seq_num;		/* transaction sequence number */
> +};
> +
> +#define HFSPLUS_JRNL_HDR_MAGIC	0x4a4e4c78
> +#define HFSPLUS_JRNL_HDR_ENDIAN	0x12345678
> +
> +/* Journal data run. Describes sequential data stored in the journal. */
> +struct hfs_jrnl_run {
> +	u64 first_block;	/* Block number the run begins at. Note that it
> +				 * is journal blocks, not filesystem blocks.
> +				 * Value ~0 means the run should be skipped. */
> +	u32 byte_len;		/* Data size in bytes. Multiple of
> +				 * journal block size. */
> +	u32 checksum;		/* Data checksum. */
> +};
> +
> +/*
> + * Journal list. Describes one or more journaled data runs.
> + * It is normally a single transaction.
> + */
> +struct hfs_jrnl_list {
> +	u16 reserved;
> +	u16 count;			/* number of runs plus 1 */
> +	u32 length_with_data;		/* length of the list and data */
> +	u32 checksum;			/* checksum of the first 32 bytes */
> +	u32 flags;			/* see possible flags below */
> +	u64 reserved1;			/* unused part of 1st fake run */
> +	u32 reserved2;			/* unused part of 1st fake run */
> +	u32 tr_end_or_seq_num;		/* Before sequence numbers introduction:
> +					   zero value means end of transaction.
> +					   After sequence numbers introduction:
> +					   a non-zero sequence number. */
> +	struct hfs_jrnl_run runs[0];	/* number of elements is count-1 */
> +};
> +

I really dislike the idea:
(1) to name structures in really different manner as in Technical Note
TN1150;
(2) to name fields of structures in really different manner as in
Technical Note TN1150;
(3) to distort a structure with comparing with Technical Note TN1150 (I
mean struct block_list_header, for example).

In such way you simply confuse everybody. Because your code becomes
hardly understandable and obscure. From another point of view, you added
last_seq_num field in journal header. But, actually, you don't use this
field for comparing this value with value in a last transaction. So,
declarations in  my patchset are much better.

Moreover, usually, it is used __le or __be types for declarations of
on-disk layout's types. Yes, I understand that HFS+ journal can be as
little-endian as big-endian. And I think that Intel architecture is more
frequent case now. So, it makes sense to use __le types for on-disk
layout HFS+ journal declarations (in the case journal header, block list
header and block info).

I really dislike idea to use u16, u32 and u64 types for on-disk layout
declarations. It's really confusing and it doesn't provide any profit.
There are four possible principal situations:
(1) LE system <-> LE journal;
(2) BE system <-> LE journal;
(3) LE system <-> BE journal;
(4) BE system <-> BE journal.
So, you need to use xx_to_cpu() and cpu_to_xx() family of methods,
anyway.

> diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
> index 80875aa..84b832e 100644
> --- a/fs/hfsplus/super.c
> +++ b/fs/hfsplus/super.c
> @@ -403,6 +403,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>  		goto out_unload_nls;
>  	}
>  
> +reread_vhdr:
>  	/* Grab the volume header */
>  	if (hfsplus_read_wrapper(sb)) {

OK. I have the bug here in my patchset. But it can be easy to fix by
adoption your approach here. 

>  		if (!silent)
> @@ -447,6 +448,31 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->s_op = &hfsplus_sops;
>  	sb->s_maxbytes = MAX_LFS_FILESIZE;
>  
> +	if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
> +	    !(sb->s_flags & MS_RDONLY)) {
> +		bool journal_was_empty;
> +		int replay_error;
> +
> +		pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
> +		sb->s_flags |= MS_RDONLY;
> +
> +		replay_error = hfsplus_jrnl_replay(sb,
> +			be32_to_cpu(vhdr->jrnl_info_block), &journal_was_empty);
> +		if (replay_error)
> +			pr_err("journal replay error %d\n", replay_error);
> +
> +		/*
> +		 * Reread volume header after replay, no matter if it succeeded
> +		 * or not. We've just set read-only flag, so no further replay
> +		 * attempts will happen.
> +		 */
> +		kfree(sbi->s_vhdr_buf);
> +		kfree(sbi->s_backup_vhdr_buf);
> +		sbi->s_vhdr_buf = 0;
> +		sbi->s_backup_vhdr_buf = 0;
> +		goto reread_vhdr;
> +	}
> +
>  	if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) {
>  		pr_warn("Filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  mounting read-only.\n");
>  		sb->s_flags |= MS_RDONLY;
> @@ -455,10 +481,6 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>  	} else if (vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) {
>  		pr_warn("Filesystem is marked locked, mounting read-only.\n");
>  		sb->s_flags |= MS_RDONLY;
> -	} else if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
> -			!(sb->s_flags & MS_RDONLY)) {
> -		pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
> -		sb->s_flags |= MS_RDONLY;

Simple removing this lines is wrong. It needs to rework mount options.

Adding of journal replay means necessity to rework mount options. The
"force" mount option was suggested because absence of journal replay
functionality. So, this reworking is really necessary. And I had
reworked mount options in my patchset by means of deleting of "force"
and adding of "norecovery" mount options. Otherwise, leaving mount
options untouched looks like a schizophrenia.

[TO RESUME]

I have such principal points:
(1) On-disk layout declarations should live in peace with Technical Note
TN1150;
(2) Journal replay should be transaction based. I mean that it needs to
check transaction and to commit one by one.
(3) "Force" mount option should be changed on "norecovery" mount option.
(4) Journal initialization should be implemented.

As a result, I think that my patchset is much better as starting point
for joint implementation with you.

I suggest:
(1) To fix in my patchset bugs and shortcomings by means of adoption of
some solutions from your patch (reread superblock, using
get_unaligned_be64() and so on);
(2) Journal replay should be transaction based. So, it makes sense to
implement optimization of transaction processing by means of read-ahead
of transaction's blocks before transaction processing. Such submission
can be made by means of optimized version of hfsplus_submit_bio() or
likewise analog. Moreover, it is possible to calculate block's checksum
in bio->bi_end_io method for the case of block sizes lesser than 8 KB
(512, 1 KB, 2 KB, 4 KB).

Thanks,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] hfsplus: add journal replay
  2014-03-09 16:36 ` Vyacheslav Dubeyko
@ 2014-03-09 22:50   ` Sergei Antonov
  2014-03-13 10:20     ` Vyacheslav Dubeyko
  0 siblings, 1 reply; 17+ messages in thread
From: Sergei Antonov @ 2014-03-09 22:50 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton, Hin-Tak Leung

On 9 March 2014 17:36, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:
> Hi Sergei,
>
> On Mon, 2014-02-24 at 11:05 +0100, Sergei Antonov wrote:
>
> We have really different vision of implementation for this functionality
> and I have many remarks about your implementation. So, I suggest to
> discuss principal points, firstly. We need to achieve understanding of
> each other and, then, it makes sense to implement joint solution.
> Currently, both of our implementations have shortcomings, from my point
> of view.
>
>> The replay code checks the integrity of the whole journal before actually
>> replaying sectors. If any error (e.g. checksum mismatch, invalid/misaligned
>> offsets) is detected, the volume will remain unchanged and the journal will not
>> be cleared.
>
> I think that such approach is overkill. And, moreover, this is a wrong
> approach.
>
> First of all, journal contains sequence of transactions. Technical Note
> TN1150: "A group of related changes is called a transaction. When all of
> the changes of a transaction have been written to their normal locations
> on disk, that transaction has been committed, and is removed from the
> journal.

This is about journaling changes, not about replay.

> The journal may contain several transactions. Copying changes
> from all transactions to their normal locations on disk is called
> replaying the journal". So, transactions can be committed one after
> another and it leaves HFS+ volume in consistent state after each commit.

Of course! They can also be committed in groups. As long as you follow
the rule: first replay a transaction or a bunch of them then update
journal header to remove them.

> It doesn't make sense to check all transactions preliminary and only
> then to commit.

One paragraph earlier it was wrong, now it just doesn't make sense :).

With regard to speed it makes sense. It's less IO. And also a more
sequential IO (I read journal sequentially without interruptions, then
replay all collected sectors sorted by number.)

> Secondly, for example, you had sudden power-off. It means that last
> transaction will be not finished.

You have a poor understanding of HFS+ journal. After sudden power-off
the last transaction is good. The unfinished (not completely written)
transaction may be the one that is beyond header->data_end. The header
is updated only after it is completely on the disk. This is why header
is 1 sector long - header update is atomic, and the journal is
consistent any time.

If you've found a corrupted transactions it is not a sudden power-off,
rather it is your bug :) or a bug in FS driver that wrote them. Wise
handling is: cry about it using printk, leave the journal unchanged,
mount for read-only. This is what my code does.

> And, finally, if you firstly check all transactions then it means that
> you need to read all of its in memory before real replaying. And, as far
> as I can judge, you simply keep all sectors in memory before replay. For
> example, your system has 512 MB of RAM and HFS+ volume has fully filled
> journal with 512 MB in size. How lucky will you with journal replay on
> such system without a swap?

Pretty lucky :) because:
1st I store data to write to sectors, not all journal.
2nd there is a feature of my code which significantly minimizes memory
usage. Different data in journal is destined to same sectors, and I
avoid allocating separate memory buffers for it. An awesome
<linux/interval_tree_generic.h> data structure helps me do it, see
hfsplus_replay_data_add().

To defend my approach more, I can make measurements of memory
consumption for a 512 MB journal. Would you like me to?

> So, I use another approach in my code. I check transaction and then to
> replay it. And I think that transaction based journal replay is
> reasonable and right way.
>
>> Tested with:
>>  drives with 512 and 4K sector sizes
>
> I've tested my code with 512 bytes, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32
> KB, 64 KB block sizes. And I think that such testing is really
> necessary. I've found and fix several bugs namely for block sizes are
> greater than 4 KB.

But I tested with 4K iPad Nano. My code works right, your code caused
file system errors after replay. I sent fsck_hfs output and relevant
lines from system log in your thread titled "[PATCH v4 00/15] hfsplus:
introduce journal replay functionality".

> Your code is really complex and I am afraid that 8 -
> 64 KB block sizes will reveal some bugs.

Well, check it. Do not just spread FUD.
I checked your code and made a bugreport.
My journal.c has no known bugs so far. Find one.

>>  default (8 MB) and custom (1 MB, 500 MB) journal sizes
>>  big-endian (PowerPC) and little-endian (Intel) journal byte sexes
>
> My last version of the patchset is ready for as little-endian as
> big-endian journal. So, I can't see difference in our patches here.
>
>> Advantages over a previously presented
>>    "[PATCH v4 00/15] hfsplus: introduce journal replay functionality"
>>    by a different author:
>
> My name is not "different author". My name is Vyacheslav Dubeyko. If you
> want to insult me then you had achieved the goal.

No flame, please :).

>> * 12 times faster mount with replay (0.5s versus 6s on my test volume)
>
> OK. Let's talk about it.
>
> I had thought during implementation about more optimal bio submission.
> But I simply had chosen hfsplus_submit_bio() method. Because journal
> replay is not so frequent operation that it takes place only during
> mount operation. Usually, the journal contains several transactions
> after clean umount (or it can be simply empty). It can contain many
> transaction in the case of sudden power-off, for example. But such
> situation is not so frequent and it is more important for a end-user to
> have reliable journal replay. I suppose that every user is ready to wait
> a spare second (or even a minute) for achieving consistent state of a
> volume after mount. Because journal replay has goal to guarantee file
> system consistency but not performance.
>
> OK. I believe also that performance is important. But you had suggested
> really big and complicated implementation

Mine is 5 files changed, 1009 insertions(+), 6 deletions(-).
Yours is 9 files changed, 1559 insertions(+), 39 deletions(-).

> that, moreover, it is really
> hard to understand at one glance. I can suggest much, much and much
> simpler solution. First of all, transaction based journal replay is my
> principal position. Secondly, it is possible to submit reading of all
> transaction's blocks before starting to process transaction. Such
> submission can be made by means of optimized version of
> hfsplus_submit_bio() or likewise analog. Thirdly, moreover, it is
> possible to calculate block's checksum in bio->bi_end_io method for the
> case of block sizes lesser than 8 KB (512, 1 KB, 2 KB, 4 KB). Such
> method is called at the end of block processing by block layer.
>
>> * supports 4K sector devices
>
> Could you point out where do you make special efforts for it? And why do
> you need to implement something in this direction?

There is no special place in code. The code is sector-size-agnostic
and I tested with a 4K-sector device.

> I tried to find likewise implementation in your code but I failed. I
> suspect that I misunderstand something.
>
>> * no redundant stuff like journal initialization
>
> My patchset is first step in implementation of full-featured journaling
> support functionality for HFS+ driver. So, to implement journal
> initialization functionality with journal replay functionality is
> reasonable action. Because these two features are related to each
> others, from my viewpoint.
>
> Of course, it is possible not initialize journal during mount on current
> step of implementation. But to implement the journal initialization
> functionality on the step of implementation of journal replay
> functionality is really necessary.

It is not necessary for journal replay. Quite the opposite - it has
nothing to do with it.

> But, I think that it needs to initialize journal during mount of HFS+
> volume (for the case of kJIJournalNeedInitMask flag). Because it will
> coincide with Technical Note TN1150.

I answered to this your argument before. Copy-pasting:
=======
Which does not mean initialization is necessary. It is, of course,
necessary for drivers that write to journal because they need the
journal to write transactions to. But in this driver you initialize it
and ... what? Nothing. It will not be used.
=======

> And if we have bugs in this
> functionality then it needs to fix it. Anyway, if we want to have
> full-featured journaling support functionality for HFS+ driver then we
> need to implement the journal initialization functionality.
>
>> * no fanatical use of macros with parameters
>
> I disagree that I use macros with fanaticism. I agree that it is really
> bad taste to use macros in C++. But it is common technique for C
> language. And this technique is widely used in Linux kernel. So, it is
> not my invention to use macros.
>
> Moreover, macros in my pathset are simple code patterns that it can be
> seen as simple abbreviation. For example:
>
> #define JIB_BLOCK(sb) \
>         (be32_to_cpu(HFSPLUS_SB(sb)->s_vhdr->journal_info_block))
>
> I am unable to see anything bad in such macro. It shortens and to make
> more sensible a code in the place of using.
>
> But why do you use with fanaticism such macros as hfs_dbg(),
> be32_to_cpu() and so on? Stop using macros in your patch if you appeal
> about macro abstinence.

I do not inroduce macros with params.
An inline function would do the job instead of JIB_BLOCK you mention.

>> * shorter, easier to review
>
> Really? :)
>
> It is really hard to review your patch for me. Because it is big,
> complex and it doesn't be divided on logical parts.

My patch is not quite divisible because of 1015 lines 909 are the new
file journal.c and 65 lines are new on-disk structures in .h. Other 41
lines are minimal to the point of being atomic :).
On the site I replyed above.

> Your patch contains 1015 lines of code at whole.
>
> My patch contains 1598 lines of code at whole. If I exclude: (1) several
> big comments - 99 lines, (2) journal initialization functionality - ~192
> lines, (3) "norecovery" mount option functionality - ~134 lines, then, I
> will have about 1173 lines of code in my patchset.
>
> So, 1015 lines of code vs. 1173 lines. It is not so big difference. And
> if you adopt my comments ~100 lines (I believe that my comments are
> really reasonable) and "norecovery" mount option ~100 lines of code then
> the difference will vanish completely.
>
>> @@ -166,6 +167,10 @@ struct hfsplus_sb_info {
>>       u32 total_blocks;
>>       u32 data_clump_blocks, rsrc_clump_blocks;
>>
>> +     /* immutable data on the placement of the journal (if there is one) */
>> +     u64 journal_offset;     /* relative to volume start */
>> +     u64 journal_size;       /* relative to volume start */
>> +
>
> My patchset is the first step in implementation of full featured journal
> support in HFS+ driver. So, I had suggested in my patchset to store in
> superblock info such structure is related to journal info:
>
> struct hfsplus_journal {
>         struct mutex jnl_lock;
>
>         /* Journal info block specific */
>         void *jib_buf;
>         struct hfsplus_journal_info_block *jib;
>
>         /* Journal header specific */
>         void *jh_buf;
>         struct hfsplus_journal_header *jh;
>         u8 endianness;
>
>         /* TODO: Pointer of JBD */
>
>         struct super_block *sbp;
> };
>
> I suppose that such structure is reasonable and it contains all
> necessary starting details for journaling on HFS+ volume.
>
> Otherwise, if we are talking about journal replay only, then, I don't
> see any necessity in keeping any info about journal. In such case, it
> doesn't makes sense to add in superblock info structure something is
> related to journal details. Journal replaying is all-sufficient
> operation.

It is an unimportant issue IMO. I can even get rid of all changes to
struct hfsplus_sb_info in v3 version of my patch. A big special
structure may be introduced later, when and if journaling modification
code is developed.

>> +/* journal.c */
>> +int hfsplus_jrnl_replay(struct super_block *, u32, bool *);
>
> I really dislike "jrnl" abbreviation. Using "journal" instead of "jrnl"
> is really reasonable and sensible way. I don't think that it makes sense
> to economize on 3 symbols.

Oh, that isn't worth much discussion. You'd like a longer function prefix? OK.

> I tried to understand how you process binfo array. But it is really hard
> to understand the whole patch. So, I simply ask. One block list (binfo
> array) can have size from 4KB till 16KB. Thereby, block list can be
> placed in several file system blocks. I have implemented and have tested
> special functionality for it. How your patch is ready for situation when
> binfo array will be placed in several blocks?

If these are sequential blocks on disk, they are read through bio. Bio
can read several blocks at once.
If the structure is split into two parts, it is read in two parts.
See hfsplus_jrnl_queue_read() for details. The execution gets there
from hfsplus_jrnl_read_lists() -> hfsplus_jrnl_read_buf().

>> +
>>  /* time macros */
>>  #define __hfsp_mt2ut(t)              (be32_to_cpu(t) - 2082844800U)
>>  #define __hfsp_ut2mt(t)              (cpu_to_be32(t + 2082844800U))
>> diff --git a/fs/hfsplus/hfsplus_raw.h b/fs/hfsplus/hfsplus_raw.h
>> index 8ffb3a8..73ba1b6 100644
>> --- a/fs/hfsplus/hfsplus_raw.h
>> +++ b/fs/hfsplus/hfsplus_raw.h
>> @@ -105,7 +105,7 @@ struct hfsplus_vh {
>>       __be16 version;
>>       __be32 attributes;
>>       __be32 last_mount_vers;
>> -     u32 reserved;
>> +     __be32 jrnl_info_block;
>>
>>       __be32 create_date;
>>       __be32 modify_date;
>> @@ -145,6 +145,69 @@ struct hfsplus_vh {
>>  #define HFSPLUS_VOL_JOURNALED                (1 << 13)
>>  #define HFSPLUS_VOL_SOFTLOCK         (1 << 15)
>>
>> +/* Journal info */
>> +struct hfs_jrnl_info {
>> +     __be32 flags;
>> +     u32 unused[8];
>> +     __be64 journal_offset;  /* Misaligned! relative to volume start */
>> +     __be64 journal_size;    /* Misaligned! relative to volume start */
>> +} __packed;
>> +
>> +/* Journal info flags */
>> +#define HFSPLUS_JRNL_IN_FS           1
>> +#define HFSPLUS_JRNL_ON_OTHER_DEVICE 2
>> +#define HFSPLUS_JRNL_NEED_INIT               4
>> +
>> +/* Journal header */
>> +struct hfs_jrnl_header {
>> +     u32 magic;
>> +     u32 endian;
>> +     u64 data_begin;         /* relative to journal start */
>> +     u64 data_end;           /* relative to journal start */
>> +     u64 journal_size;
>> +     u32 list_size;          /* size of hfs_jrnl_list */
>> +     u32 checksum;
>> +     u32 hdr_size;           /* Header size. Tells us where journal buffer
>> +                              * begins. Also works as journal block size. */
>> +     u32 last_seq_num;               /* transaction sequence number */
>> +};
>> +
>> +#define HFSPLUS_JRNL_HDR_MAGIC       0x4a4e4c78
>> +#define HFSPLUS_JRNL_HDR_ENDIAN      0x12345678
>> +
>> +/* Journal data run. Describes sequential data stored in the journal. */
>> +struct hfs_jrnl_run {
>> +     u64 first_block;        /* Block number the run begins at. Note that it
>> +                              * is journal blocks, not filesystem blocks.
>> +                              * Value ~0 means the run should be skipped. */
>> +     u32 byte_len;           /* Data size in bytes. Multiple of
>> +                              * journal block size. */
>> +     u32 checksum;           /* Data checksum. */
>> +};
>> +
>> +/*
>> + * Journal list. Describes one or more journaled data runs.
>> + * It is normally a single transaction.
>> + */
>> +struct hfs_jrnl_list {
>> +     u16 reserved;
>> +     u16 count;                      /* number of runs plus 1 */
>> +     u32 length_with_data;           /* length of the list and data */
>> +     u32 checksum;                   /* checksum of the first 32 bytes */
>> +     u32 flags;                      /* see possible flags below */
>> +     u64 reserved1;                  /* unused part of 1st fake run */
>> +     u32 reserved2;                  /* unused part of 1st fake run */
>> +     u32 tr_end_or_seq_num;          /* Before sequence numbers introduction:
>> +                                        zero value means end of transaction.
>> +                                        After sequence numbers introduction:
>> +                                        a non-zero sequence number. */
>> +     struct hfs_jrnl_run runs[0];    /* number of elements is count-1 */
>> +};
>> +
>
> I really dislike the idea:
> (1) to name structures in really different manner as in Technical Note
> TN1150;

See explanation for (3) below.

> (2) to name fields of structures in really different manner as in
> Technical Note TN1150;

You contradict yourself sometimes :). I remember when you yourself
suggested changing Apple's name of a structure field. It was when we
discussed another patch:
========
What about shorter name as "subfolders" for both fields in struct
hfsplus_inode_info and in struct hfsplus_cat_folder? I suppose it will
be more informative and correct.
========
Apple's name was "folderCount", my name was after it: "folder_count",
you suggested a different, but indeed better, "subfolders". In my
patch I also found several better (IMO) names.

> (3) to distort a structure with comparing with Technical Note TN1150 (I
> mean struct block_list_header, for example).

I'll explain.
block_list_header is not good because it describes sequential block
runs, not blocks. To avoid a double meaning of "block" I came up with:
  struct block_list_header (not really a block is meant) -> struct hfs_jrnl_list
  struct block_info (not really a block is meant) -> struct hfs_jrnl_run
Also changed fields:
  block_info::bnum (now comes the real block!) ->
hfs_jrnl_run::first_block (the 1st block of the run)
  block_info::bsize (size in blocks?) -> hfs_jrnl_run::byte_len (no,
it is in bytes! make it clear)

> In such way you simply confuse everybody. Because your code becomes
> hardly understandable and obscure. From another point of view, you added
> last_seq_num field in journal header. But, actually, you don't use this
> field for comparing this value with value in a last transaction.

It is used for tracing:
+       hfs_dbg(JOURNAL,
+               "last seq num in journal data 0x%x, last seq num in
header 0x%x\n",
+               rd->last_list_seq_num, rd->header->last_seq_num);

What do you use that comparison for?
I guess, comparison is only needed for extra transactions replay. Do
you know what extra transactions in HFS+ journal are? They are not in
the spec, they are in the code. Search for "extra transactions" in
Apple's vfs_journal.c. Weird feature, I've newer seen it on a volume.

> So,
> declarations in  my patchset are much better.
>
> Moreover, usually, it is used __le or __be types for declarations of
> on-disk layout's types. Yes, I understand that HFS+ journal can be as
> little-endian as big-endian. And I think that Intel architecture is more
> frequent case now. So, it makes sense to use __le types for on-disk
> layout HFS+ journal declarations (in the case journal header, block list
> header and block info).
>
> I really dislike idea to use u16, u32 and u64 types for on-disk layout
> declarations. It's really confusing and it doesn't provide any profit.
> There are four possible principal situations:
> (1) LE system <-> LE journal;
> (2) BE system <-> LE journal;
> (3) LE system <-> BE journal;
> (4) BE system <-> BE journal.
> So, you need to use xx_to_cpu() and cpu_to_xx() family of methods,
> anyway.

I need? No. For cases 2 and 3 I use swab32() and swap them in-place,
this is why u32/u16 are reasonable. In cases 1 and 4 the values are
already in cpu order, and u32/u16 are absolutely reasonable.

>> diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
>> index 80875aa..84b832e 100644
>> --- a/fs/hfsplus/super.c
>> +++ b/fs/hfsplus/super.c
>> @@ -403,6 +403,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>>               goto out_unload_nls;
>>       }
>>
>> +reread_vhdr:
>>       /* Grab the volume header */
>>       if (hfsplus_read_wrapper(sb)) {
>
> OK. I have the bug here in my patchset. But it can be easy to fix by
> adoption your approach here.
>
>>               if (!silent)
>> @@ -447,6 +448,31 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>>       sb->s_op = &hfsplus_sops;
>>       sb->s_maxbytes = MAX_LFS_FILESIZE;
>>
>> +     if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
>> +         !(sb->s_flags & MS_RDONLY)) {
>> +             bool journal_was_empty;
>> +             int replay_error;
>> +
>> +             pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
>> +             sb->s_flags |= MS_RDONLY;
>> +
>> +             replay_error = hfsplus_jrnl_replay(sb,
>> +                     be32_to_cpu(vhdr->jrnl_info_block), &journal_was_empty);
>> +             if (replay_error)
>> +                     pr_err("journal replay error %d\n", replay_error);
>> +
>> +             /*
>> +              * Reread volume header after replay, no matter if it succeeded
>> +              * or not. We've just set read-only flag, so no further replay
>> +              * attempts will happen.
>> +              */
>> +             kfree(sbi->s_vhdr_buf);
>> +             kfree(sbi->s_backup_vhdr_buf);
>> +             sbi->s_vhdr_buf = 0;
>> +             sbi->s_backup_vhdr_buf = 0;
>> +             goto reread_vhdr;
>> +     }
>> +
>>       if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) {
>>               pr_warn("Filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  mounting read-only.\n");
>>               sb->s_flags |= MS_RDONLY;
>> @@ -455,10 +481,6 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>>       } else if (vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) {
>>               pr_warn("Filesystem is marked locked, mounting read-only.\n");
>>               sb->s_flags |= MS_RDONLY;
>> -     } else if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
>> -                     !(sb->s_flags & MS_RDONLY)) {
>> -             pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
>> -             sb->s_flags |= MS_RDONLY;
>
> Simple removing this lines is wrong. It needs to rework mount options.
>
> Adding of journal replay means necessity to rework mount options. The
> "force" mount option was suggested because absence of journal replay
> functionality. So, this reworking is really necessary. And I had
> reworked mount options in my patchset by means of deleting of "force"
> and adding of "norecovery" mount options. Otherwise, leaving mount
> options untouched looks like a schizophrenia.

I'd like to make our common patch and I'd totally trust you to do
changes to mount options. You are good at it.

> [TO RESUME]
>
> I have such principal points:
> (1) On-disk layout declarations should live in peace with Technical Note
> TN1150;

Well, I gave my rationale. The discussion can be continued. I am not
against changes to my cool names.

> (2) Journal replay should be transaction based. I mean that it needs to
> check transaction and to commit one by one.

I'd agree on commit at some threshold. If collected data exceeds 1
megabyte, then write transactions and free some memory.

> (3) "Force" mount option should be changed on "norecovery" mount option.

As you wish.

> (4) Journal initialization should be implemented.

Unneeded and it makes a patch heavier.

> As a result, I think that my patchset is much better as starting point
> for joint implementation with you.
>
> I suggest:
> (1) To fix in my patchset bugs and shortcomings by means of adoption of
> some solutions from your patch (reread superblock, using
> get_unaligned_be64() and so on);
> (2) Journal replay should be transaction based. So, it makes sense to
> implement optimization of transaction processing by means of read-ahead
> of transaction's blocks before transaction processing. Such submission
> can be made by means of optimized version of hfsplus_submit_bio() or
> likewise analog. Moreover, it is possible to calculate block's checksum
> in bio->bi_end_io method for the case of block sizes lesser than 8 KB
> (512, 1 KB, 2 KB, 4 KB).

It is possible to calculate checksums, I already do it in my code. But
why bio->bi_end_io? I do not get the idea.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] hfsplus: add journal replay
  2014-03-09 22:50   ` Sergei Antonov
@ 2014-03-13 10:20     ` Vyacheslav Dubeyko
  2014-03-13 20:04       ` Sergei Antonov
  0 siblings, 1 reply; 17+ messages in thread
From: Vyacheslav Dubeyko @ 2014-03-13 10:20 UTC (permalink / raw)
  To: Sergei Antonov
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton, Hin-Tak Leung

On Sun, 2014-03-09 at 23:50 +0100, Sergei Antonov wrote:

I postpone discussions about small topics and to continue discussion
about two principal topics: (1) structure declarations; (2) transaction
based nature of journal replay. It doesn't make sense for me to discuss
minor issues without consent about principal topics.

> My name is not "different author". My name is Vyacheslav Dubeyko. If you
> > want to insult me then you had achieved the goal.
> 
> No flame, please :).

You started to blame me. So, simply enjoy by blaming style from my side.
I will answer to you as I want.

> I checked your code and made a bugreport.

Your report is related to the issue that I don't re-read in my code a
volume header after journal replay. This bug can be fixed easily. I've
written to you about it yet.

> Mine is 5 files changed, 1009 insertions(+), 6 deletions(-).
> Yours is 9 files changed, 1559 insertions(+), 39 deletions(-).

I simply repeat below what I've written in previous e-mail.

My patch contains 1598 lines of code at whole. If I exclude: (1) several
big comments - 99 lines, (2) journal initialization functionality - ~192
lines, (3) "norecovery" mount option functionality - ~134 lines, then, I
will have about 1173 lines of code in my patchset.

So, 1015 lines of code vs. 1173 lines. It is not so big difference. And
if you adopt my comments ~100 lines (I believe that my comments are
really reasonable) and "norecovery" mount option ~100 lines of code then
the difference will vanish completely.

> You contradict yourself sometimes :). I remember when you yourself
> suggested changing Apple's name of a structure field. It was when we
> discussed another patch:
> ========
> What about shorter name as "subfolders" for both fields in struct
> hfsplus_inode_info and in struct hfsplus_cat_folder? I suppose it will
> be more informative and correct.
> ========
> Apple's name was "folderCount", my name was after it: "folder_count",
> you suggested a different, but indeed better, "subfolders". In my
> patch I also found several better (IMO) names.

Specification of HFS+ file system (Technical Note TN1150) doesn't define
this field. It simply declares it as reserved field:

struct HFSPlusCatalogFolder {
    SInt16              recordType;
    UInt16              flags;
    UInt32              valence;
    HFSCatalogNodeID    folderID;
    UInt32              createDate;
    UInt32              contentModDate;
    UInt32              attributeModDate;
    UInt32              accessDate;
    UInt32              backupDate;
    HFSPlusBSDInfo      permissions;
    FolderInfo          userInfo;
    ExtendedFolderInfo  finderInfo;
    UInt32              textEncoding;
    UInt32              reserved;
};

So, in such situation we have some freedom in field naming. Because
specification doesn't contain any description.

The "folderCount" name contradicts to kernel coding style. I suppose
that "subfolders" is much sensible and shorter name as "folder_count".
Moreover, you've accepted my suggestion.

> > (3) to distort a structure with comparing with Technical Note TN1150 (I
> > mean struct block_list_header, for example).
> 
> I'll explain.
> block_list_header is not good because it describes sequential block
> runs, not blocks. To avoid a double meaning of "block" I came up with:
>   struct block_list_header (not really a block is meant) -> struct hfs_jrnl_list
>   struct block_info (not really a block is meant) -> struct hfs_jrnl_run
> Also changed fields:
>   block_info::bnum (now comes the real block!) ->
> hfs_jrnl_run::first_block (the 1st block of the run)
>   block_info::bsize (size in blocks?) -> hfs_jrnl_run::byte_len (no,
> it is in bytes! make it clear)

I don't think that you've suggested better names. It is completely
obscure and not informative for me, personally.

And I always say NACK of your patch while we haven't naming of
structures and fields as it described in specification of HFS+
(Technical Note TN1150). Because specification is common point for
everybody who try to understand HFS+ internals and functionality. And if
your code doesn't comply with specification then everybody will have
troubles with understanding of the code. Simply write your own file
system if you want to name in your own way always. We have specification
for HFS+.

I prefer to leave mainline untouched then to have incorrect or obscure
code for HFS+.

Of course, it is possible to discuss every name but the common remark
from my side is: (1) your names are longer; (2) your names are obscure.

> > First of all, journal contains sequence of transactions. Technical Note
> > TN1150: "A group of related changes is called a transaction. When all of
> > the changes of a transaction have been written to their normal locations
> > on disk, that transaction has been committed, and is removed from the
> > journal.
> 
> This is about journaling changes, not about replay.
> 

This is related as journaling as journal replay.

> > Secondly, for example, you had sudden power-off. It means that last
> > transaction will be not finished.
> 
> You have a poor understanding of HFS+ journal. After sudden power-off
> the last transaction is good. The unfinished (not completely written)
> transaction may be the one that is beyond header->data_end. The header
> is updated only after it is completely on the disk. This is why header
> is 1 sector long - header update is atomic, and the journal is
> consistent any time.
> 
> If you've found a corrupted transactions it is not a sudden power-off,
> rather it is your bug :) or a bug in FS driver that wrote them. Wise
> handling is: cry about it using printk, leave the journal unchanged,
> mount for read-only. This is what my code does.

I insist on transaction-based journal replay. I don't think that I have
poor understanding of journaling at whole.

There are many possible reasons that transaction in journal can be
broken:
(1) You can't guarantee that initiated flush will succeed. As a result,
file system driver can think that all is OK with written data but really
storage can have troubled data.
(2) Even if data was written successfully then it doesn't mean that
these data will be read successfully because of different storage'
troubles. We live in not ideal world.

So, any transaction can be broken, potentially. Moreover, you can't
guarantee anything with last transaction after sudden power-off. As a
result, it makes sense to replay as much as possible valid transactions
before discovering of invalid one.

> > And, finally, if you firstly check all transactions then it means that
> > you need to read all of its in memory before real replaying. And, as far
> > as I can judge, you simply keep all sectors in memory before replay. For
> > example, your system has 512 MB of RAM and HFS+ volume has fully filled
> > journal with 512 MB in size. How lucky will you with journal replay on
> > such system without a swap?
> 
> Pretty lucky :) because:
> 1st I store data to write to sectors, not all journal.
> 2nd there is a feature of my code which significantly minimizes memory
> usage. Different data in journal is destined to same sectors, and I
> avoid allocating separate memory buffers for it. An awesome
> <linux/interval_tree_generic.h> data structure helps me do it, see
> hfsplus_replay_data_add().
> 
> To defend my approach more, I can make measurements of memory
> consumption for a 512 MB journal. Would you like me to?

There are many possible situation that you doesn't take into account:
(1) Journal can be 1 GB in size, for example. And some system can have
512 MB of RAM. And size of journal potentially can be much more,
potentially.
(2) You can't predict in what environment will work system and under
what memory pressure a HFS+ volume will be mounted. So, trying to read
all transactions in memory can be resulted with memory allocation
failure with significant probability. And such approach is very
dangerous way, from my point of view.

[TO RESUME]

I have such principal points:
(1) On-disk layout declarations should live in peace with Technical Note
TN1150;
(2) Journal replay should be transaction based. I insist namely on
transaction-based replay.

I always say NACK of your patch while we haven't naming of structures
and fields as it described in specification of HFS+ (Technical Note
TN1150) and if journal replay will be not transaction-based.

I prefer to leave mainline untouched then to have incorrect or obscure
code for HFS+.

Thanks,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] hfsplus: add journal replay
  2014-03-13 10:20     ` Vyacheslav Dubeyko
@ 2014-03-13 20:04       ` Sergei Antonov
  2014-03-23 23:15         ` A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay) Hin-Tak Leung
  0 siblings, 1 reply; 17+ messages in thread
From: Sergei Antonov @ 2014-03-13 20:04 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton, Hin-Tak Leung

On 13 March 2014 11:20, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:

> > I checked your code and made a bugreport.
>
> Your report is related to the issue that I don't re-read in my code a
> volume header after journal replay. This bug can be fixed easily. I've
> written to you about it yet.

But you do not reread VH on any device and the error was only on a 4K
sector device. OK, thanks for telling anyway.

> So, in such situation we have some freedom in field naming. Because
> specification doesn't contain any description.

Spec is not to be objected, reference implementation - have some
freedom. OK, you've made your point :).

Will you at least not object naming block_info::next
block_info::checksum (i.e. "checksum" instead of "next")? Just want to
know the limits of your esteem for this (dated) spec.

> The "folderCount" name contradicts to kernel coding style. I suppose
> that "subfolders" is much sensible and shorter name as "folder_count".
> Moreover, you've accepted my suggestion.
>
>> > (3) to distort a structure with comparing with Technical Note TN1150 (I
>> > mean struct block_list_header, for example).
>>
>> I'll explain.
>> block_list_header is not good because it describes sequential block
>> runs, not blocks. To avoid a double meaning of "block" I came up with:
>>   struct block_list_header (not really a block is meant) -> struct hfs_jrnl_list
>>   struct block_info (not really a block is meant) -> struct hfs_jrnl_run
>> Also changed fields:
>>   block_info::bnum (now comes the real block!) ->
>> hfs_jrnl_run::first_block (the 1st block of the run)
>>   block_info::bsize (size in blocks?) -> hfs_jrnl_run::byte_len (no,
>> it is in bytes! make it clear)
>
> I don't think that you've suggested better names. It is completely
> obscure and not informative for me, personally.

My reasoning did not have an effect. Oh, well.

By the way, did you notice how elegantly I fixed
block_list_header::binfo problem with binfo[0] meaning something
completely different than other elements? I shifted first element's
fields to the structure:

struct hfs_jrnl_list {
        u16 reserved;
        u16 count;                      /* number of runs plus 1 */
        u32 length_with_data;           /* length of the list and data */
        u32 checksum;                   /* checksum of the first 32 bytes */
        u32 flags;                      /* see possible flags below */
        u64 reserved1;                  /* unused part of 1st fake run */
        u32 reserved2;                  /* unused part of 1st fake run */
        u32 tr_end_or_seq_num;          /* Before sequence numbers introduction:
                                           zero value means end of transaction.
                                           After sequence numbers introduction:
                                           a non-zero sequence number. */
        struct hfs_jrnl_run runs[0];    /* number of elements is count-1 */
};

Isn't that clearer than the original struct? IMO the original
structure - that's what really deserves to be called "obscure and not
informative".

> And I always say NACK of your patch while we haven't naming of
> structures and fields as it described in specification of HFS+
> (Technical Note TN1150). Because specification is common point for
> everybody who try to understand HFS+ internals and functionality. And if
> your code doesn't comply with specification then everybody will have
> troubles with understanding of the code. Simply write your own file
> system if you want to name in your own way always. We have specification
> for HFS+.
>
> I prefer to leave mainline untouched then to have incorrect or obscure
> code for HFS+.
>
> Of course, it is possible to discuss every name but the common remark
> from my side is: (1) your names are longer; (2) your names are obscure.
>
>> > First of all, journal contains sequence of transactions. Technical Note
>> > TN1150: "A group of related changes is called a transaction. When all of
>> > the changes of a transaction have been written to their normal locations
>> > on disk, that transaction has been committed, and is removed from the
>> > journal.
>>
>> This is about journaling changes, not about replay.
>>
>
> This is related as journaling as journal replay.
>
>> > Secondly, for example, you had sudden power-off. It means that last
>> > transaction will be not finished.
>>
>> You have a poor understanding of HFS+ journal. After sudden power-off
>> the last transaction is good. The unfinished (not completely written)
>> transaction may be the one that is beyond header->data_end. The header
>> is updated only after it is completely on the disk. This is why header
>> is 1 sector long - header update is atomic, and the journal is
>> consistent any time.
>>
>> If you've found a corrupted transactions it is not a sudden power-off,
>> rather it is your bug :) or a bug in FS driver that wrote them. Wise
>> handling is: cry about it using printk, leave the journal unchanged,
>> mount for read-only. This is what my code does.
>
> I insist on transaction-based journal replay. I don't think that I have
> poor understanding of journaling at whole.
>
> There are many possible reasons that transaction in journal can be
> broken:
> (1) You can't guarantee that initiated flush will succeed. As a result,
> file system driver can think that all is OK with written data but really
> storage can have troubled data.
> (2) Even if data was written successfully then it doesn't mean that
> these data will be read successfully because of different storage'
> troubles. We live in not ideal world.

All these rare curruption cases are "cry about it using printk, leave
the journal unchanged".

Ok, what do you think of an implementation that works as follows?

if (a broken transaction is read) {
  1. stop reading journal /* obvious */
  2. replay all good transactions /* they are good and deserve it */
  3. do not update the journal header /* we do not want to deal with it */
  4. mount read-only /* safe behavior for any volume with a journal,
but in this case it is an absolute must */
}

It differs from my current logic in an added step 2.

> So, any transaction can be broken, potentially. Moreover, you can't
> guarantee anything with last transaction after sudden power-off. As a
> result, it makes sense to replay as much as possible valid transactions
> before discovering of invalid one.
>
>> > And, finally, if you firstly check all transactions then it means that
>> > you need to read all of its in memory before real replaying. And, as far
>> > as I can judge, you simply keep all sectors in memory before replay. For
>> > example, your system has 512 MB of RAM and HFS+ volume has fully filled
>> > journal with 512 MB in size. How lucky will you with journal replay on
>> > such system without a swap?
>>
>> Pretty lucky :) because:
>> 1st I store data to write to sectors, not all journal.
>> 2nd there is a feature of my code which significantly minimizes memory
>> usage. Different data in journal is destined to same sectors, and I
>> avoid allocating separate memory buffers for it. An awesome
>> <linux/interval_tree_generic.h> data structure helps me do it, see
>> hfsplus_replay_data_add().
>>
>> To defend my approach more, I can make measurements of memory
>> consumption for a 512 MB journal. Would you like me to?
>
> There are many possible situation that you doesn't take into account:
> (1) Journal can be 1 GB in size, for example. And some system can have
> 512 MB of RAM. And size of journal potentially can be much more,
> potentially.
> (2) You can't predict in what environment will work system and under
> what memory pressure a HFS+ volume will be mounted. So, trying to read
> all transactions in memory can be resulted with memory allocation
> failure with significant probability. And such approach is very
> dangerous way, from my point of view.

A threshold, as I already suggested, seems to be a good protection.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-13 20:04       ` Sergei Antonov
@ 2014-03-23 23:15         ` Hin-Tak Leung
  2014-03-24  6:56           ` Vyacheslav Dubeyko
  2014-03-25  1:37           ` Sergei Antonov
  0 siblings, 2 replies; 17+ messages in thread
From: Hin-Tak Leung @ 2014-03-23 23:15 UTC (permalink / raw)
  To: Vyacheslav Dubeyko, Sergei Antonov
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton

Hi Vyacheslav and Sergei,

I was hoping not to feed into the issues. I believe the bottom line is that, there are
a few interesting and challenging issues with HFS+ support, and there are simply not
enough people who are both knowledgeable and motivated to improve on it. So we
should try to work together. Does it sound reasonable?

Sergei: - I appreciate that you are the first to point out the endian issue with the journal,
and been working on other aspects, such as the foldercount issue. I would be happy
to see more work from you, and continue to be willing to have a look and review, where
time permits. However, I do see a few problems with the manner in which you approach
this work. From my personal point of view (and I cannot speak for others), file system
work is data-safety first. So while Vyacheslav's work might be on the verbose side,
it is re-assuring to read, maybe repeatedly, about the details, than skipping on boring
details. I don't think length of code should be a criteria, and certainly not "elegance",
or cleverness. Here is a widely quoted saying from Brian Kernighan:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the
code as cleverly as possible, you are, by definition, not smart enough to debug it."

- Also, if you deviate from how HFS+ works either at the specification level, the behavior level
or the code level, you are just implementing your own journalling work. I would also have
preferred that you point out exactly where Vyacheslav's work went wrong, and supply
a patch on top, instead of starting your own. If you want credit for it, I hope you can
arrange with Vyacheslav to swap the order of signed-off, for one or more of the merged patches,
for example. In any case, while I am happy to see your work, and am willing to review it,
you have not convinced me to try your work out yet. So.

Vyacheslav: Thanks for the continuous good work. While your adherence to TN1150 is
good, I have noticed a few times that you have not paid attention to Apple's
implementation of it - xnu's kernel code and diskdev_cmds (apple's mkfs/fsck). It is apparent
in some of the discussion between you and Sergei that he has read some of those but
you haven't. This is shown in the discussion about meaning and naming of variables,
and interpretation of reserved fields. About TN1150, I think there is one instance about
journalling support (a block count?) where it is ambiguous and contradicctory about a +1 offset.
Anyway, I think reference to the actual implementations are welcomed in resolving
these ambiguities and clarifying issues.

To the both of you, looking forward, and a few technical issues:

- I think fsck.hfsplus must manipulate the journal - fs with a non-empty journal is by definition
unclean; but fsck may or may not be doing journal replay. It probably does something else,
like just zero'ing the journal and fixes inconsistency in the rest. At some point we should find
out what fsck does when it fixes an unclean fs with non-empty journal.

- I have a more re-produce'able recipe of making disk images with strange 
location of 2nd volume header using hdiutil (Apple's loopback mounting, etc disk image
manipulation tool). At some point we should re-visit this issue.

- on the many-year-old issue of the driver getting confused just running "du" recursing
down large directories, I have managed to make it happen under Ftrace the Linux
kernel internal tracer, and even with code mod's, but the resulting log is too
large and events are being dropped. So I need to do some filtering to get
events down to a fully-recordable level. I.e. it would help to confirm issues
if I know where to look...  

Hin-Tak
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-23 23:15         ` A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay) Hin-Tak Leung
@ 2014-03-24  6:56           ` Vyacheslav Dubeyko
  2014-03-25  1:37           ` Sergei Antonov
  1 sibling, 0 replies; 17+ messages in thread
From: Vyacheslav Dubeyko @ 2014-03-24  6:56 UTC (permalink / raw)
  To: htl10
  Cc: Sergei Antonov, linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton

Hi Hin-Tak

On Sun, 2014-03-23 at 23:15 +0000, Hin-Tak Leung wrote:
> Hi Vyacheslav and Sergei,
> 
> I was hoping not to feed into the issues. I believe the bottom line is that, there are
> a few interesting and challenging issues with HFS+ support, and there are simply not
> enough people who are both knowledgeable and motivated to improve on it. So we
> should try to work together. Does it sound reasonable?
> 

Thank you for suggestions and sharing your opinion.

Sorry, currently, I've decided to finish my implementation activity for
HFS+. I will work for another file system. It is not emotions. I really
haven't time for useless activity without results. I have many working
duties and it is really hard to find time for spending it in useless
way.

I am open for discussion but I don't see discussion for this topic. I
see only blaming and stupid declaration about superiority of one patch
with regard of another. If I suggest really bad patches then I should
stop to do it. Because I really haven't time for useless activity and
meaningless discussions. But I will ever disagree with Sergei's patch in
the current state because I think that my remarks are reasonable.

So, I've decided to spend my time on implementation functionality for
another file systems because there are many interesting directions for
it. If I encounter likewise situation in the future then I will simply
stop my open-source activity, completely. Because there are many
interesting things in real life.

Thanks,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-23 23:15         ` A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay) Hin-Tak Leung
  2014-03-24  6:56           ` Vyacheslav Dubeyko
@ 2014-03-25  1:37           ` Sergei Antonov
  2014-03-25  6:30             ` Vyacheslav Dubeyko
  1 sibling, 1 reply; 17+ messages in thread
From: Sergei Antonov @ 2014-03-25  1:37 UTC (permalink / raw)
  To: htl10
  Cc: Vyacheslav Dubeyko, linux-fsdevel, Al Viro, Christoph Hellwig,
	Andrew Morton

> Am 24.03.2014 um 00:15 schrieb Hin-Tak Leung <htl10@users.sourceforge.net>:
> 
> Hi Vyacheslav and Sergei,
> 
> I was hoping not to feed into the issues. I believe the bottom line is that, there are
> a few interesting and challenging issues with HFS+ support, and there are simply not
> enough people who are both knowledgeable and motivated to improve on it. So we
> should try to work together. Does it sound reasonable?

Yes, it does.

> 
> Sergei: - I appreciate that you are the first to point out the endian issue with the journal,
> and been working on other aspects, such as the foldercount issue. I would be happy
> to see more work from you, and continue to be willing to have a look and review, where
> time permits. However, I do see a few problems with the manner in which you approach
> this work. From my personal point of view (and I cannot speak for others), file system
> work is data-safety first.

Let me inject a remark here.
A reader of my discussion with Vyacheslav here may get an impression that my code cares less about data safety. This impression is false. To make a long story short: replaying+removing transactions one by one is as safe as replaying them all at once and then removing them alltogether. Maybe it is not so for other journal formats, for HFS+ journal it is.

> So while Vyacheslav's work might be on the verbose side,
> it is re-assuring to read, maybe repeatedly, about the details, than skipping on boring
> details. I don't think length of code should be a criteria, and certainly not "elegance",
> or cleverness. Here is a widely quoted saying from Brian Kernighan:
> "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the
> code as cleverly as possible, you are, by definition, not smart enough to debug it."
> 
> - Also, if you deviate from how HFS+ works either at the specification level, the behavior level
> or the code level, you are just implementing your own journalling work.

Sigh. You have been misled. That's probably because claims like "you do not follow this excerpt from the spec" are more impressive and easier to remember than my following explanations why I am not.

> I would also have
> preferred that you point out exactly where Vyacheslav's work went wrong, and supply
> a patch on top, instead of starting your own. If you want credit for it, I hope you can
> arrange with Vyacheslav to swap the order of signed-off, for one or more of the merged patches,
> for example. In any case, while I am happy to see your work, and am willing to review it,
> you have not convinced me to try your work out yet. So.

I am ready for further convincing efforts.
Though now that Vyacheslav has stopped replying in the thread there is no progress. Two compromises have been suggested by me in the thread, I'll recap them here for you or anyone who wasn't following.
1. Since collecting all transactions requires more memory (up to 12 MB in my tests, but theoretically more), I suggested a threshold. Let's say it will be 1 MB. Once collected data reaches one meg, flush it, free the buffers and go on collecting.
2. A very simple suggestions that makes my logic closer to Vyacheslav's one-by-one transaction processing approach. In case we encounter a corrupted transaction, replay (good) transactions collected so far. Corrupted transaction is a rare thing (caused by cosmic rays!), but partial replay looks sane and not a big change to the code.

> Vyacheslav: Thanks for the continuous good work. While your adherence to TN1150 is
> good, I have noticed a few times that you have not paid attention to Apple's
> implementation of it - xnu's kernel code and diskdev_cmds (apple's mkfs/fsck). It is apparent
> in some of the discussion between you and Sergei that he has read some of those but
> you haven't. This is shown in the discussion about meaning and naming of variables,
> and interpretation of reserved fields. About TN1150, I think there is one instance about
> journalling support (a block count?) where it is ambiguous and contradicctory about a +1 offset.
> Anyway, I think reference to the actual implementations are welcomed in resolving
> these ambiguities and clarifying issues.
> 
> To the both of you, looking forward, and a few technical issues:
> 
> - I think fsck.hfsplus must manipulate the journal - fs with a non-empty journal is by definition
> unclean; but fsck may or may not be doing journal replay. It probably does something else,
> like just zero'ing the journal and fixes inconsistency in the rest. At some point we should find
> out what fsck does when it fixes an unclean fs with non-empty journal.

Easy. There is no non-empty journal by the time there is something to fix. It is replayed or discarded before the check.

> - I have a more re-produce'able recipe of making disk images with strange 
> location of 2nd volume header using hdiutil (Apple's loopback mounting, etc disk image
> manipulation tool). At some point we should re-visit this issue.

Well, present your recepie. But! There is only one proper placement for this header. All other placements are improper, and the spec even explicitly gives one example of an improper placement. I imagine no ways to get it wrong. I see no need for any elaborate code for finding the 2nd header.

> - on the many-year-old issue of the driver getting confused just running "du" recursing
> down large directories, I have managed to make it happen under Ftrace the Linux
> kernel internal tracer, and even with code mod's, but the resulting log is too
> large and events are being dropped. So I need to do some filtering to get
> events down to a fully-recordable level. I.e. it would help to confirm issues
> if I know where to look...  
> 
> Hin-Tak
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-25  1:37           ` Sergei Antonov
@ 2014-03-25  6:30             ` Vyacheslav Dubeyko
  2014-03-28 13:42               ` Sergei Antonov
  0 siblings, 1 reply; 17+ messages in thread
From: Vyacheslav Dubeyko @ 2014-03-25  6:30 UTC (permalink / raw)
  To: Sergei Antonov
  Cc: htl10, linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton

On Tue, 2014-03-25 at 02:37 +0100, Sergei Antonov wrote:

> > I would also have
> > preferred that you point out exactly where Vyacheslav's work went wrong, and supply
> > a patch on top, instead of starting your own. If you want credit for it, I hope you can
> > arrange with Vyacheslav to swap the order of signed-off, for one or more of the merged patches,
> > for example. In any case, while I am happy to see your work, and am willing to review it,
> > you have not convinced me to try your work out yet. So.
> 
> I am ready for further convincing efforts.
> Though now that Vyacheslav has stopped replying in the thread there is no progress. Two compromises have been suggested by me in the thread, I'll recap them here for you or anyone who wasn't following.
> 1. Since collecting all transactions requires more memory (up to 12 MB in my tests, but theoretically more), I suggested a threshold. Let's say it will be 1 MB. Once collected data reaches one meg, flush it, free the buffers and go on collecting.
> 2. A very simple suggestions that makes my logic closer to Vyacheslav's one-by-one transaction processing approach. In case we encounter a corrupted transaction, replay (good) transactions collected so far. Corrupted transaction is a rare thing (caused by cosmic rays!), but partial replay looks sane and not a big change to the code.
> 

It doesn't make sense for me to repeat the same statements again and
again.

I have such principal points:
(1) On-disk layout declarations should live in peace with Technical Note
TN1150;
(2) Journal replay should be transaction based.

I'll never change my opinion about it.

Thanks,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-25  6:30             ` Vyacheslav Dubeyko
@ 2014-03-28 13:42               ` Sergei Antonov
  2014-03-28 15:09                 ` Vyacheslav Dubeyko
  0 siblings, 1 reply; 17+ messages in thread
From: Sergei Antonov @ 2014-03-28 13:42 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: htl10, linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton

On 25 March 2014 07:30, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:
> On Tue, 2014-03-25 at 02:37 +0100, Sergei Antonov wrote:
>
>> > I would also have
>> > preferred that you point out exactly where Vyacheslav's work went wrong, and supply
>> > a patch on top, instead of starting your own. If you want credit for it, I hope you can
>> > arrange with Vyacheslav to swap the order of signed-off, for one or more of the merged patches,
>> > for example. In any case, while I am happy to see your work, and am willing to review it,
>> > you have not convinced me to try your work out yet. So.
>>
>> I am ready for further convincing efforts.
>> Though now that Vyacheslav has stopped replying in the thread there is no progress. Two compromises have been suggested by me in the thread, I'll recap them here for you or anyone who wasn't following.
>> 1. Since collecting all transactions requires more memory (up to 12 MB in my tests, but theoretically more), I suggested a threshold. Let's say it will be 1 MB. Once collected data reaches one meg, flush it, free the buffers and go on collecting.
>> 2. A very simple suggestions that makes my logic closer to Vyacheslav's one-by-one transaction processing approach. In case we encounter a corrupted transaction, replay (good) transactions collected so far. Corrupted transaction is a rare thing (caused by cosmic rays!), but partial replay looks sane and not a big change to the code.
>>
>
> It doesn't make sense for me to repeat the same statements again and
> again.
>
> I have such principal points:
> (1) On-disk layout declarations should live in peace with Technical Note
> TN1150;
> (2) Journal replay should be transaction based.

Could you clarify "transaction based"? Seriously. Because since there
are transactions in HFS+ journal, all replay implementations are
transaction-based in a way.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-28 13:42               ` Sergei Antonov
@ 2014-03-28 15:09                 ` Vyacheslav Dubeyko
  2014-03-28 20:04                   ` Sergei Antonov
  0 siblings, 1 reply; 17+ messages in thread
From: Vyacheslav Dubeyko @ 2014-03-28 15:09 UTC (permalink / raw)
  To: Sergei Antonov
  Cc: htl10, linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton

On Fri, 2014-03-28 at 14:42 +0100, Sergei Antonov wrote:

> >
> > I have such principal points:
> > (1) On-disk layout declarations should live in peace with Technical Note
> > TN1150;
> > (2) Journal replay should be transaction based.
> 
> Could you clarify "transaction based"? Seriously. Because since there
> are transactions in HFS+ journal, all replay implementations are
> transaction-based in a way.

I suppose that I described my vision very clearly in previous e-mails.
Sorry, I don't see any necessity to repeat the same things.

If you want to ask something, please, ask concrete question. Because,
from my point of view, we had deep discussion earlier. And our
discussion hasn't goal to elaborate theoretical principles of
transaction approach. So, your current question is not concrete for me.

If you want to suggest something, please, do it. But, please, suggest
something new and don't suggest the same things.

Thanks,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-03-28 15:09                 ` Vyacheslav Dubeyko
@ 2014-03-28 20:04                   ` Sergei Antonov
  0 siblings, 0 replies; 17+ messages in thread
From: Sergei Antonov @ 2014-03-28 20:04 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: htl10, linux-fsdevel, Al Viro, Christoph Hellwig, Andrew Morton

On 28 March 2014 16:09, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:
> On Fri, 2014-03-28 at 14:42 +0100, Sergei Antonov wrote:
>
>> >
>> > I have such principal points:
>> > (1) On-disk layout declarations should live in peace with Technical Note
>> > TN1150;
>> > (2) Journal replay should be transaction based.
>>
>> Could you clarify "transaction based"? Seriously. Because since there
>> are transactions in HFS+ journal, all replay implementations are
>> transaction-based in a way.
>
> I suppose that I described my vision very clearly in previous e-mails.
> Sorry, I don't see any necessity to repeat the same things.
>
> If you want to ask something, please, ask concrete question. Because,
> from my point of view, we had deep discussion earlier. And our
> discussion hasn't goal to elaborate theoretical principles of
> transaction approach. So, your current question is not concrete for me.
>
> If you want to suggest something, please, do it. But, please, suggest
> something new and don't suggest the same things.
>
> Thanks,
> Vyacheslav Dubeyko.

My concrete question is what is your definition of the term? A
definition, you know, an unambiguous piece of text describing the
term.

Of course you wrote about it before. Let's state (if not reconcile)
our disagreements on the issue in this thread. Of the two principal
points you mentioned, one is clear to me, the other is a little too
vaguely formulated.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-04-04 23:35 Hin-Tak Leung
@ 2014-04-06 20:59 ` Sergei Antonov
  0 siblings, 0 replies; 17+ messages in thread
From: Sergei Antonov @ 2014-04-06 20:59 UTC (permalink / raw)
  To: Hin-Tak Leung
  Cc: Vyacheslav Dubeyko, linux-fsdevel, Al Viro, Christoph Hellwig,
	Andrew Morton

On 5 April 2014 01:35, Hin-Tak Leung <htl10@users.sourceforge.net> wrote:
> Hi Sergei,
>
> ------------------------------
> On Fri, Mar 28, 2014 1:42 PM GMT Sergei Antonov wrote:
>
>>> I am ready for further convincing efforts.
>
> As I said, I am happy to review your work, and I hope you and Vyacheslav can work together.
> However, I wish to point out that Vyacheslav has a track record going back for a few years
> of contributing to the hfs+ driver, and I believe there was a patch submitted a few months
> ago to make him the maintainer. In the hierarchy of linux kernel development, informal
> as it is, he is still probably one of the people that you need to convince. The way you are
> going about the matter, it is not contributing to getting your ideas accepted.

This saddens me. I thought I just asked questions.

>>> Though now that Vyacheslav has stopped replying in the thread there is no progress. Two compromises have been suggested by me in the thread, I'll recap them here for you or anyone who wasn't following.
>>> 1. Since collecting all transactions requires more memory (up to 12 MB in my tests, but theoretically more), I suggested a threshold. Let's say it will be 1 MB. Once collected data reaches one meg, flush it, free the buffers and go on collecting.
>>> 2. A very simple suggestions that makes my logic closer to Vyacheslav's one-by-one transaction processing approach. In case we encounter a corrupted transaction, replay (good) transactions collected so far. Corrupted transaction is a rare thing (caused by cosmic rays!), but partial replay looks sane and not a big change to the code.
>>>
>
> ...
>
>>Could you clarify "transaction based"? Seriously. Because since there
>>are transactions in HFS+ journal, all replay implementations are
>>transaction-based in a way.
>
> Transactions are what they are. The journal is an ordered list of transactions.
> Each of them has a checksum for integrity, and is treated as one unit of
> change. Theorectically, you should apply one after the other in the correct order,
> until the list is exhausted, or until you encounter the first of one such unit
> that does not pass sanity checks (the checksum, and other out-of-bound
> conditions, etc).
>
> How you go about it, or buffering a number of transactions for performance
> reasons, is up to you (or the implementor), *up to a certain point*. I
> suggest you supply a patch on top of Vyacheslav's patch set to
> demonstrate how you want to go about it. It is all rather vague and
> not constructive going around in circle about words and meaning of
> words. So I suggest that you show us the code to do it,
> how you would like to modify Vyacheslav's to work in your way.
>
> Be really careful about saying things like "all replay implementations". There is
> no such thing as "all" replay implementations: there is an official specification
> - TN1150, and there is an official implementation, that inside xnu and diskdev_cmds
> from Apple. Anything else is about you making a Sergei's file system that's
> not quite HFS+.
>
> If you read/write the journal qualitatively different from how Apple does it,
> you are by definition wrong, even if your implemention of journalling
> does it faster and "cleverer" than Apple's.

Apple also first collects data from all transactions and only after
that writes data to disk.
See http://www.opensource.apple.com/source/xnu/xnu-2422.1.72/bsd/vfs/vfs_journal.c
(search for "coalesce").

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
  2014-04-05  1:02 Hin-Tak Leung
@ 2014-04-06 16:01 ` Sergei Antonov
  0 siblings, 0 replies; 17+ messages in thread
From: Sergei Antonov @ 2014-04-06 16:01 UTC (permalink / raw)
  To: Hin-Tak Leung
  Cc: Vyacheslav Dubeyko, linux-fsdevel, Al Viro, Christoph Hellwig,
	Andrew Morton

On 5 April 2014 03:02, Hin-Tak Leung <htl10@users.sourceforge.net> wrote:
>>> - I have a more re-produce'able recipe of making disk images with strange
>>> location of 2nd volume header using hdiutil (Apple's loopback mounting, etc disk image
>>> manipulation tool). At some point we should re-visit this issue.
>>
>>Well, present your recepie. But! There is only one proper placement for this header. All other placements are improper, and the spec even explicitly gives one example of an improper placement. I imagine no ways to get it wrong. I see no need for any elaborate code for finding the 2nd header.
>>
>
> Apple owns the spec, and owns the rights of interpretation and mis-interpretation of it,
> and they are not bound by it, and they can freely deviate and change their minds, etc. it
> is not an iso spec, it was not submitted to external bodies for standardization - it
> is just provided as is.
>
> You can say Appple's tool is wrong - but a more pragmatic way of seeing it, is that
> they don't have to keep the spec up to date or correct; keeping the spec up to date
> is going to cost somebody's time to do so; there is no business incentive to
> spend the engineering hours to keep the spec up to date. Their tools/implementations
> in software is more authoritative than the words in the spec.
>
> hdiutil is Apple proprietary software. source code is not available.

So what is the recipe? I'll make a patch if there really is a bug in the driver.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
@ 2014-04-05  1:02 Hin-Tak Leung
  2014-04-06 16:01 ` Sergei Antonov
  0 siblings, 1 reply; 17+ messages in thread
From: Hin-Tak Leung @ 2014-04-05  1:02 UTC (permalink / raw)
  To: saproj; +Cc: slava, linux-fsdevel, viro, hch, akpm

Hi Sergei,

------------------------------
On Tue, Mar 25, 2014 1:37 AM GMT Sergei Antonov wrote:

>Let me inject a remark here.
>A reader of my discussion with Vyacheslav here may get an impression that my code cares less about data safety. This impression is false. To make a long story short: replaying+removing transactions one by one is as safe as replaying them all at once and then removing them alltogether. Maybe it is not so for other journal formats, for HFS+ journal it is.
>

I think a reader of your commit message might get the impression that you are more
concerned about performance/elegance/cleverness than correctness.

"Doing A is as safe as doing B" - you still have to pick one way or another.
Or supply a patch to do both, to demonstrate switching between the two.

>
>Sigh. You have been misled. That's probably because claims like "you do not follow this excerpt from the spec" are more impressive and easier to remember than my following explanations why I am not.
>

You are probably assuming that you will always be around to explain to future persons
who might want to understand this code; and you are assuming that you are the 
only person to be working on it? How do you feel about a future person
making a commit to revert/correct your work, for example?

>> - I have a more re-produce'able recipe of making disk images with strange 
>> location of 2nd volume header using hdiutil (Apple's loopback mounting, etc disk image
>> manipulation tool). At some point we should re-visit this issue.
>
>Well, present your recepie. But! There is only one proper placement for this header. All other placements are improper, and the spec even explicitly gives one example of an improper placement. I imagine no ways to get it wrong. I see no need for any elaborate code for finding the 2nd header.
>

Apple owns the spec, and owns the rights of interpretation and mis-interpretation of it,
and they are not bound by it, and they can freely deviate and change their minds, etc. it
is not an iso spec, it was not submitted to external bodies for standardization - it 
is just provided as is.

You can say Appple's tool is wrong - but a more pragmatic way of seeing it, is that
they don't have to keep the spec up to date or correct; keeping the spec up to date
is going to cost somebody's time to do so; there is no business incentive to
spend the engineering hours to keep the spec up to date. Their tools/implementations
in software is more authoritative than the words in the spec.

hdiutil is Apple proprietary software. source code is not available.

Hin-Tak

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay)
@ 2014-04-04 23:35 Hin-Tak Leung
  2014-04-06 20:59 ` Sergei Antonov
  0 siblings, 1 reply; 17+ messages in thread
From: Hin-Tak Leung @ 2014-04-04 23:35 UTC (permalink / raw)
  To: saproj, slava; +Cc: linux-fsdevel, viro, hch, akpm

Hi Sergei,

------------------------------
On Fri, Mar 28, 2014 1:42 PM GMT Sergei Antonov wrote:

>> I am ready for further convincing efforts.

As I said, I am happy to review your work, and I hope you and Vyacheslav can work together.
However, I wish to point out that Vyacheslav has a track record going back for a few years
of contributing to the hfs+ driver, and I believe there was a patch submitted a few months
ago to make him the maintainer. In the hierarchy of linux kernel development, informal
as it is, he is still probably one of the people that you need to convince. The way you are
going about the matter, it is not contributing to getting your ideas accepted.

>> Though now that Vyacheslav has stopped replying in the thread there is no progress. Two compromises have been suggested by me in the thread, I'll recap them here for you or anyone who wasn't following.
>> 1. Since collecting all transactions requires more memory (up to 12 MB in my tests, but theoretically more), I suggested a threshold. Let's say it will be 1 MB. Once collected data reaches one meg, flush it, free the buffers and go on collecting.
>> 2. A very simple suggestions that makes my logic closer to Vyacheslav's one-by-one transaction processing approach. In case we encounter a corrupted transaction, replay (good) transactions collected so far. Corrupted transaction is a rare thing (caused by cosmic rays!), but partial replay looks sane and not a big change to the code.
>>

...

>Could you clarify "transaction based"? Seriously. Because since there
>are transactions in HFS+ journal, all replay implementations are
>transaction-based in a way.

Transactions are what they are. The journal is an ordered list of transactions.
Each of them has a checksum for integrity, and is treated as one unit of
change. Theorectically, you should apply one after the other in the correct order,
until the list is exhausted, or until you encounter the first of one such unit
that does not pass sanity checks (the checksum, and other out-of-bound
conditions, etc).

How you go about it, or buffering a number of transactions for performance
reasons, is up to you (or the implementor), *up to a certain point*. I
suggest you supply a patch on top of Vyacheslav's patch set to
demonstrate how you want to go about it. It is all rather vague and
not constructive going around in circle about words and meaning of
words. So I suggest that you show us the code to do it,
how you would like to modify Vyacheslav's to work in your way.

Be really careful about saying things like "all replay implementations". There is
no such thing as "all" replay implementations: there is an official specification
- TN1150, and there is an official implementation, that inside xnu and diskdev_cmds
from Apple. Anything else is about you making a Sergei's file system that's
not quite HFS+.

If you read/write the journal qualitatively different from how Apple does it,
you are by definition wrong, even if your implemention of journalling
does it faster and "cleverer" than Apple's.

Hin-Tak

 





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-04-06 20:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-24 10:05 [PATCH v2] hfsplus: add journal replay Sergei Antonov
2014-02-24 10:21 ` Vyacheslav Dubeyko
2014-03-09 16:36 ` Vyacheslav Dubeyko
2014-03-09 22:50   ` Sergei Antonov
2014-03-13 10:20     ` Vyacheslav Dubeyko
2014-03-13 20:04       ` Sergei Antonov
2014-03-23 23:15         ` A few thoughts on hfsplus dev (Re: [PATCH v2] hfsplus: add journal replay) Hin-Tak Leung
2014-03-24  6:56           ` Vyacheslav Dubeyko
2014-03-25  1:37           ` Sergei Antonov
2014-03-25  6:30             ` Vyacheslav Dubeyko
2014-03-28 13:42               ` Sergei Antonov
2014-03-28 15:09                 ` Vyacheslav Dubeyko
2014-03-28 20:04                   ` Sergei Antonov
2014-04-04 23:35 Hin-Tak Leung
2014-04-06 20:59 ` Sergei Antonov
2014-04-05  1:02 Hin-Tak Leung
2014-04-06 16:01 ` Sergei Antonov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.