All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] raid5: add a log device to fix raid5/6 write hole issue
@ 2015-03-30 22:25 Shaohua Li
  2015-04-01  3:47 ` Dan Williams
  2015-04-01 21:53 ` NeilBrown
  0 siblings, 2 replies; 22+ messages in thread
From: Shaohua Li @ 2015-03-30 22:25 UTC (permalink / raw)
  To: neilb, dan.j.williams, linux-raid; +Cc: songliubraving, Kernel-team

This is my attempt to fix raid5/6 write hole issue, it's not for merge
yet, I post it out for comments. Any comments and suggestions are
welcome!

Thanks,
Shaohua

We expect a completed raid5/6 stack with reliability and high
performance. Currently raid5/6 has 2 issues:

1. read-modify-write for small size IO. To fix this issue, a cache layer
above raid5/6 can be used to aggregate write to full stripe write.
2. write hole issue. A write log below raid5/6 can fix the issue.

We plan to use a SSD to fix the two issues. Here we just fix the write
hole issue.

1. We don't try to fix the issues together. A cache layer will do write
acceleration. A log layer will fix write hole. The seperation will
simplify things a lot.

2. Current assumption is flashcache/bcache will be used as the cache
layer. If they don't work well, we can fix them or add a simple cache
layer for raid write aggregation later. We also assume cache layer will
absorb write, so log doesn't worry about write latency.

3. For log, write will hit to log disk first, then raid disks, and
finally IO completion is reported. An optimal way is to report IO
completion just after IO hits to log disk to cut write latency. But in
that way, read path need query log disk and increase complexity. And
since we don't worry about write latency, we choose a simple soltuion.
This will be revisited if there is performance issue.

This design isn't intrusive for raid5/6. Actully only very few changes
of existing code is required.

Log looks like jbd. Stripe IO to raid disks will be written to log disk
first in atomic way. Several stripe IO will consist a transaction. If
all stripes of a transaction are finished, the tranaction can be
checkpoint.

Basic logic of raid 5/6 write will be:
1. normal raid5/6 steps for a stripe (fetch data, calculate checksum,
and etc). log hooks to ops_run_io.
2. stripe is added to a transaction. Write stripe data to log disk (metadata
block, stripe data)
3. write commit block to log disk
4. flush log disk cache.
5. stripe is logged now and normal stripe handling continues

Transaction checkpoint process:
1. all stripes of a transaction are finished
2. flush disk cache of all raid disks
3. change log super to reflect new log checkpoint position
4. WRITE_FUA log super

metadata, data and commit block IO can run in the meaning time, as
checksum will be used to make sure their data is correct (like jbd2).
Log IO doesn't wait 5s to start like jbd, instead the IO will start
every time a metadata block is full. This can cut some latency.

Disk layout:

|super|metadata|data|metadata| data ... |commitdata|metadata|data| ... |commitdata|
super, metadata, commit will use one block

This is an initial version, which works but a lot of stuffes are
missing:
1. error handling
2. log recovery and impact to raid resync (don't need resync anymore)
3. utility changes

The big question is how we report log disk. In this patch, I simply use
a spare disk for testing. We need a new raid disk role for log disk.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/Makefile            |    2 +-
 drivers/md/raid5-log.c         | 1017 ++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c             |   42 +-
 drivers/md/raid5.h             |   16 +
 include/uapi/linux/raid/md_p.h |   64 +++
 5 files changed, 1136 insertions(+), 5 deletions(-)
 create mode 100644 drivers/md/raid5-log.c

diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index a2da532..a0dee4c 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -16,7 +16,7 @@ dm-cache-mq-y   += dm-cache-policy-mq.o
 dm-cache-cleaner-y += dm-cache-policy-cleaner.o
 dm-era-y	+= dm-era-target.o
 md-mod-y	+= md.o bitmap.o
-raid456-y	+= raid5.o
+raid456-y	+= raid5.o raid5-log.o
 
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
diff --git a/drivers/md/raid5-log.c b/drivers/md/raid5-log.c
new file mode 100644
index 0000000..d27317f
--- /dev/null
+++ b/drivers/md/raid5-log.c
@@ -0,0 +1,1017 @@
+#include <linux/kernel.h>
+#include <linux/wait.h>
+#include <linux/blkdev.h>
+#include <linux/crc32.h>
+#include <linux/raid/md_p.h>
+#include "md.h"
+#include "raid5.h"
+
+struct r5log_journal {
+	struct mddev *mddev;
+	struct md_rdev *rdev;
+	struct page *super_page;
+
+	u64 transaction_id; /* next tid for transaction */
+	u64 checkpoint_transaction_id;
+
+	u32 block_size;
+	u32 block_sector_shift;
+	u64 total_blocks;
+	u64 first_block;
+	u64 last_block;
+
+	u64 last_checkpoint;
+	u64 log_start; /* next transaction starts here */
+
+	u8 data_checksum_type;
+	u8 meta_checksum_type;
+
+	u8 uuid[16];
+
+	struct list_head transaction_list;
+	int transaction_cnt;
+	struct r5log_transaction *current_trans;
+
+	struct list_head pending_stripes;
+	spinlock_t stripes_lock;
+
+	struct md_thread *thread;
+	unsigned long last_commit_time;
+
+	struct mutex jlock;
+
+	int aborted;
+	int do_commit;
+	int do_discard;
+};
+
+struct r5log_io_control {
+	struct r5log_transaction *trans;
+	struct page *meta_page;
+	struct page *commit_page;
+	int tag_index;
+
+	struct bio_list bios;
+	struct bio *current_bio;
+	atomic_t refcnt;
+};
+
+struct r5log_transaction {
+	struct r5log_journal *journal;
+	u64 transaction_id;
+	struct list_head stripe_list;
+	int stripe_cnt;
+	struct r5log_io_control *current_io;
+	atomic_t io_pending;
+	atomic_t stripe_pending;
+	u64 log_start; /* block */
+	u64 log_end; /* block */
+	struct list_head next_trans;
+
+	int state;
+	wait_queue_head_t wait_state;
+
+	struct mutex tlock;
+
+	struct work_struct flush_work;
+};
+
+enum {
+	/* transaction accepts new IO */
+	TRANSACTION_RUNNING = 0,
+	/* transaction is frozen. commit block IO is running */
+	TRANSACTION_COMMITED = 1,
+	/* transaction IO is finished, FLUSH request is running */
+	TRANSACTION_FLUSHING = 2,
+	/*
+	 * transaction FLUSH request is finished, stripes are considered
+	 * on-disk now, those stripes start hitting to raid disks
+	 * */
+	TRANSACTION_STRIPE_RUNNING = 3,
+	/*
+	 * Stripes of transaction hit to raid disks, transaction space can be
+	 * reclaimed
+	 * */
+	TRANSACTION_CHECKPOINT_READY = 4,
+};
+
+#define MAX_TAG_PER_BLOCK(journal) ((journal->block_size - \
+	sizeof(struct r5log_meta_header)) / \
+	sizeof(struct r5log_meta_block_tag))
+#define PAGE_TO_BLOCKS(journal) (PAGE_SIZE / journal->block_size)
+#define BLOCK_TO_SECTOR(b, journal) ((b) << journal->block_sector_shift)
+
+#define TRANSACTION_MAX_SIZE (2 * 1024 * 1024)
+#define TRANSACTION_MAX_STRIPES 128
+#define TRANSACTION_TIMEOUT (5 * HZ)
+
+static void r5log_journal_thread(struct md_thread *thread);
+static int r5log_do_commit(struct r5log_transaction *trans);
+static int r5log_do_checkpoint(struct r5log_journal *journal, u64 blocks);
+
+static u32 r5log_calculate_checksum(struct r5log_journal *journal, u32 crc,
+	void *buf, ssize_t size, bool data)
+{
+	if (journal->data_checksum_type != R5LOG_CHECKSUM_CRC32)
+		BUG();
+	if (journal->meta_checksum_type != R5LOG_CHECKSUM_CRC32)
+		BUG();
+	return crc32_le(crc, buf, size);
+}
+
+static int r5log_read_super(struct r5log_journal *journal)
+{
+	struct md_rdev *rdev = journal->rdev;
+	struct r5log_super_block *sb_blk;
+	struct page *page = journal->super_page;
+	u32 crc = ~0, stored_crc;
+
+	if (!sync_page_io(rdev, 0, PAGE_SIZE, page, READ, false))
+		return -EIO;
+
+	sb_blk = kmap_atomic(page);
+
+	if (le32_to_cpu(sb_blk->version) != RAID5_LOG_VERSION ||
+	    le32_to_cpu(sb_blk->header.magic) != RAID5_LOG_MAGIC ||
+	    le32_to_cpu(sb_blk->header.type) != R5LOG_TYPE_SUPER ||
+	    le64_to_cpu(sb_blk->header.position) != 0)
+		goto error;
+
+	journal->checkpoint_transaction_id =
+		le64_to_cpu(sb_blk->header.transaction_id);
+	journal->transaction_id = journal->checkpoint_transaction_id + 1;
+
+	journal->block_size = le32_to_cpu(sb_blk->block_size);
+	journal->block_sector_shift = ilog2(journal->block_size >> 9);
+
+	/* Only support this stripe size right now */
+	if (le32_to_cpu(sb_blk->stripe_size) != PAGE_SIZE)
+		goto error;
+	if (journal->block_size > PAGE_SIZE)
+		goto error;
+
+	if (sb_blk->meta_checksum_type >= R5LOG_CHECKSUM_NR ||
+	    sb_blk->data_checksum_type >= R5LOG_CHECKSUM_NR)
+		goto error;
+	journal->meta_checksum_type = sb_blk->meta_checksum_type;
+	journal->data_checksum_type = sb_blk->data_checksum_type;
+
+	stored_crc = le32_to_cpu(sb_blk->header.checksum);
+	sb_blk->header.checksum = 0;
+	crc = r5log_calculate_checksum(journal, ~0,
+		sb_blk, journal->block_size, false);
+	crc = r5log_calculate_checksum(journal, crc,
+		sb_blk->uuid, sizeof(sb_blk->uuid), false);
+	if (crc != stored_crc)
+		goto error;
+
+	if (memcmp(journal->uuid, sb_blk->uuid, sizeof(journal->uuid)))
+		goto error;
+
+	journal->first_block = le64_to_cpu(sb_blk->first_block);
+	if (journal->first_block != 1)
+		goto error;
+	journal->total_blocks = le64_to_cpu(sb_blk->total_blocks);
+	journal->last_block = journal->first_block + journal->total_blocks;
+	journal->last_checkpoint = le64_to_cpu(sb_blk->last_checkpoint);
+	kunmap_atomic(sb_blk);
+
+	return 0;
+error:
+	kunmap_atomic(sb_blk);
+	return -EINVAL;
+}
+
+static int r5log_write_super(struct r5log_journal *journal)
+{
+	struct r5log_super_block *sb_blk;
+	u32 crc;
+	struct timespec now = current_kernel_time();
+
+	sb_blk = kmap_atomic(journal->super_page);
+	sb_blk->header.checksum = 0;
+	sb_blk->header.transaction_id =
+		cpu_to_le64(journal->checkpoint_transaction_id);
+	sb_blk->last_checkpoint =
+		cpu_to_le64(journal->last_checkpoint);
+	sb_blk->update_time_sec = cpu_to_le64(now.tv_sec);
+	sb_blk->update_time_nsec = cpu_to_le64(now.tv_nsec);
+
+	crc = r5log_calculate_checksum(journal, ~0,
+		sb_blk, journal->block_size, false);
+	crc = r5log_calculate_checksum(journal, crc,
+		sb_blk->uuid, sizeof(sb_blk->uuid), false);
+	sb_blk->header.checksum = cpu_to_le32(crc);
+	kunmap_atomic(sb_blk);
+
+	if (sync_page_io(journal->rdev, 0, journal->block_size,
+	    journal->super_page, WRITE_FUA, false))
+		return 0;
+	return -EIO;
+}
+
+static int r5log_recover_journal(struct r5log_journal *journal)
+{
+	return 0;
+}
+
+#define DBG 1
+#if DBG
+void r5log_fake_super(struct md_rdev *rdev)
+{
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_ZERO);
+	struct r5log_super_block *sb_blk;
+	u32 crc;
+
+#define BLKSIZE 4096
+	sb_blk = kmap_atomic(page);
+	sb_blk->header.magic = cpu_to_le32(RAID5_LOG_MAGIC);
+	sb_blk->header.type = cpu_to_le32(R5LOG_TYPE_SUPER);
+	sb_blk->header.transaction_id = cpu_to_le32(0x111);
+	sb_blk->version = cpu_to_le32(RAID5_LOG_VERSION);
+	sb_blk->stripe_size = cpu_to_le32(PAGE_SIZE);
+	sb_blk->block_size = cpu_to_le32(BLKSIZE);
+	sb_blk->total_blocks = rdev->sectors * 512 / BLKSIZE;
+	sb_blk->first_block = 1;
+	sb_blk->last_checkpoint = 0;
+	sb_blk->meta_checksum_type = R5LOG_CHECKSUM_CRC32;
+	sb_blk->data_checksum_type = R5LOG_CHECKSUM_CRC32;
+	memcpy(sb_blk->uuid, rdev->mddev->uuid, sizeof(sb_blk->uuid));
+
+	crc = crc32_le(~0, sb_blk, BLKSIZE);
+	crc = crc32_le(crc, sb_blk->uuid, sizeof(sb_blk->uuid));
+	sb_blk->header.checksum = crc;
+	kunmap_atomic(sb_blk);
+
+	sync_page_io(rdev, 0, BLKSIZE, page, WRITE, false);
+	__free_page(page);
+}
+#endif
+
+struct r5log_journal *r5log_load_journal(struct md_rdev *rdev)
+{
+	struct r5log_journal *journal;
+
+#if DBG
+	r5log_fake_super(rdev);
+#endif
+
+	journal = kzalloc(sizeof(*journal), GFP_KERNEL);
+	if (!journal)
+		return NULL;
+
+	journal->super_page = alloc_page(GFP_KERNEL);
+	if (!journal->super_page)
+		goto err_page;
+
+	journal->mddev = rdev->mddev;
+	journal->rdev = rdev;
+	memcpy(journal->uuid, rdev->mddev->uuid, sizeof(journal->uuid));
+
+	INIT_LIST_HEAD(&journal->transaction_list);
+	INIT_LIST_HEAD(&journal->pending_stripes);
+	spin_lock_init(&journal->stripes_lock);
+	mutex_init(&journal->jlock);
+
+	if (r5log_read_super(journal))
+		goto err_super;
+
+	journal->do_discard = blk_queue_discard(bdev_get_queue(rdev->bdev));
+
+	if (journal->last_checkpoint != 0) {
+		if (r5log_recover_journal(journal))
+			goto err_super;
+	}
+
+	journal->log_start = 1;
+	journal->last_checkpoint = 1;
+
+	journal->last_commit_time = jiffies;
+	journal->thread = md_register_thread(r5log_journal_thread,
+		journal->mddev, "journal");
+	journal->thread->timeout = TRANSACTION_TIMEOUT;
+
+	return journal;
+err_super:
+	__free_page(journal->super_page);
+err_page:
+	kfree(journal);
+	return NULL;
+}
+
+int r5log_flush_journal(struct r5log_journal *journal)
+{
+	mutex_lock(&journal->jlock);
+	r5log_do_checkpoint(journal, -1);
+	mutex_unlock(&journal->jlock);
+	return 0;
+}
+
+void r5log_free_journal(struct r5log_journal *journal)
+{
+	r5log_flush_journal(journal);
+	md_unregister_thread(&journal->thread);
+
+	__free_page(journal->super_page);
+	kfree(journal);
+}
+
+static u64 r5log_ring_size(struct r5log_journal *journal, u64 start,
+	u64 end)
+{
+	return (journal->total_blocks + end - start) %
+		journal->total_blocks;
+}
+
+static bool r5log_has_room(struct r5log_journal *journal,
+	u64 log_end, int data_blocks)
+{
+	int tags = data_blocks * journal->block_size / PAGE_SIZE;
+	/* data + commit + meta + possible hole */
+	int blocks = data_blocks + 1 +
+		tags / MAX_TAG_PER_BLOCK(journal) + 1 +
+		PAGE_SIZE / journal->block_size - 1;
+
+	return journal->total_blocks - r5log_ring_size(journal,
+		journal->last_checkpoint, log_end)
+		>= blocks + 1;
+}
+
+static int r5log_wait_for_space(struct r5log_transaction *trans,
+	int data_blocks)
+{
+	BUG_ON(!mutex_is_locked(&trans->tlock));
+
+	if (r5log_has_room(trans->journal, trans->log_end, data_blocks))
+		return 0;
+	mutex_unlock(&trans->tlock);
+	r5log_do_checkpoint(trans->journal, data_blocks);
+	return -EAGAIN;
+}
+
+static bool r5log_check_and_freeze_current_transaction(
+	struct r5log_journal *journal)
+{
+	struct r5log_transaction *trans = journal->current_trans;
+
+	BUG_ON(!mutex_is_locked(&journal->jlock));
+
+	/*
+	 * r5log_do_commit doesn't do this because journal->jlock
+	 * isn't hold
+	 **/
+	if (trans && trans->state >= TRANSACTION_COMMITED) {
+		journal->log_start = trans->log_end;
+		journal->current_trans = NULL;
+		return true;
+	}
+	return false;
+}
+
+static struct r5log_transaction *
+r5log_start_transaction(struct r5log_journal *journal, int data_blocks)
+{
+	struct r5log_transaction *trans;
+
+	mutex_lock(&journal->jlock);
+	/* FIXME: if transaction_cnt >= xxx, do checkpoint */
+again:
+	trans = journal->current_trans;
+	if (!trans) {
+		trans = kzalloc(sizeof(*trans), GFP_NOIO | __GFP_REPEAT);
+		trans->journal = journal;
+		trans->transaction_id = journal->transaction_id;
+		atomic_set(&trans->io_pending, 1);
+		atomic_set(&trans->stripe_pending, 1);
+		INIT_LIST_HEAD(&trans->stripe_list);
+		trans->log_start = journal->log_start;
+		trans->log_end = journal->log_start;
+		trans->state = TRANSACTION_RUNNING;
+		init_waitqueue_head(&trans->wait_state);
+		mutex_init(&trans->tlock);
+
+		list_add_tail(&trans->next_trans, &journal->transaction_list);
+		journal->transaction_id++;
+		journal->current_trans = trans;
+		journal->transaction_cnt++;
+	}
+	mutex_lock(&trans->tlock);
+
+	if (r5log_check_and_freeze_current_transaction(journal)) {
+		mutex_unlock(&trans->tlock);
+		goto again;
+	}
+
+	if (r5log_wait_for_space(trans, data_blocks))
+		goto again;
+
+	mutex_unlock(&journal->jlock);
+	return trans;
+}
+
+static u64 r5log_transaction_size(struct r5log_transaction *trans)
+{
+	struct r5log_journal *journal = trans->journal;
+
+	return r5log_ring_size(journal, trans->log_start, trans->log_end);
+}
+
+static void r5log_end_transaction(struct r5log_transaction *trans)
+{
+	struct r5log_journal *journal = trans->journal;
+
+	/* trans is big, commit it */
+	if (r5log_transaction_size(trans) * journal->block_size >=
+	    TRANSACTION_MAX_SIZE || trans->stripe_cnt >=
+	    TRANSACTION_MAX_STRIPES)
+		r5log_do_commit(trans);
+	mutex_unlock(&trans->tlock);
+}
+
+static void r5log_transaction_finish_stripe(struct r5log_transaction *trans)
+{
+	if (!atomic_dec_and_test(&trans->stripe_pending))
+		return;
+	trans->state = TRANSACTION_CHECKPOINT_READY;
+	wake_up(&trans->wait_state);
+}
+
+static void r5log_restore_stripe_data(struct stripe_head *sh)
+{
+	int i;
+
+	for (i = 0; i < sh->disks; i++) {
+		void *addr;
+		if (!test_and_clear_bit(R5_Escaped, &sh->dev[i].flags))
+			continue;
+		addr = kmap_atomic(sh->dev[i].page);
+		*(__le32 *)addr = cpu_to_le32(RAID5_LOG_MAGIC);
+		kunmap_atomic(addr);
+	}
+}
+
+static void r5log_trans_flush_diskcache(struct work_struct *work)
+{
+	struct r5log_transaction *trans;
+	struct stripe_head *sh;
+
+	trans = container_of(work, struct r5log_transaction, flush_work);
+	blkdev_issue_flush(trans->journal->rdev->bdev, GFP_NOIO, NULL);
+	trans->state = TRANSACTION_STRIPE_RUNNING;
+	wake_up(&trans->wait_state);
+
+	while (!list_empty(&trans->stripe_list)) {
+		sh = list_first_entry(&trans->stripe_list, struct stripe_head,
+			log_list);
+		list_del_init(&sh->log_list);
+		r5log_restore_stripe_data(sh);
+		atomic_inc(&trans->stripe_pending);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		release_stripe(sh);
+	}
+	r5log_transaction_finish_stripe(trans);
+}
+
+static void r5log_transaction_finish_io(struct r5log_transaction *trans)
+{
+	if (!atomic_dec_and_test(&trans->io_pending))
+		return;
+
+	/*
+	 * FIXME: we can release_stripe before flush disk cache if there is no
+	 * FUA or FLUSH request
+	 * */
+	trans->state = TRANSACTION_FLUSHING;
+	INIT_WORK(&trans->flush_work, r5log_trans_flush_diskcache);
+	schedule_work(&trans->flush_work);
+}
+
+static void r5log_put_io_control(struct r5log_io_control *io)
+{
+	struct r5log_transaction *trans = io->trans;
+
+	if (!atomic_dec_and_test(&io->refcnt))
+		return;
+	__free_page(io->meta_page);
+	if (io->commit_page)
+		__free_page(io->commit_page);
+	kfree(io);
+
+	r5log_transaction_finish_io(trans);
+}
+
+static void r5log_end_io(struct bio *bio, int error)
+{
+	struct r5log_io_control *io = bio->bi_private;
+
+	if (error) {
+		io->trans->journal->aborted = 1;
+		printk(KERN_ERR"r5log IO error\n");
+	}
+	r5log_put_io_control(io);
+	bio_put(bio);
+}
+
+static int r5log_submit_io(struct r5log_transaction *trans)
+{
+	struct r5log_journal *journal = trans->journal;
+	struct r5log_io_control *io = trans->current_io;
+	struct bio *bio;
+	struct r5log_meta_block *meta;
+	u32 crc;
+	u32 tag_flags;
+
+	meta = kmap_atomic(io->meta_page);
+	tag_flags = le32_to_cpu(meta->tags[io->tag_index - 1].flags);
+	tag_flags |= R5LOG_TAG_FLAG_LAST_TAG;
+	meta->tags[io->tag_index - 1].flags = cpu_to_le32(tag_flags);
+
+	crc = r5log_calculate_checksum(journal, ~0, meta,
+		journal->block_size, false);
+	crc = r5log_calculate_checksum(journal, crc, journal->uuid,
+		sizeof(journal->uuid), false);
+	meta->header.checksum = cpu_to_le32(crc);
+	kunmap_atomic(meta);
+
+	while ((bio = bio_list_pop(&io->bios)))
+		submit_bio(WRITE, bio);
+	r5log_put_io_control(io);
+	trans->current_io = NULL;
+	return 0;
+}
+
+int r5log_io_add_page(struct r5log_transaction *trans, struct page *page,
+	ssize_t size)
+{
+	struct r5log_journal *journal = trans->journal;
+	struct r5log_io_control *current_io = trans->current_io;
+	sector_t pos = BLOCK_TO_SECTOR(trans->log_end, journal);
+	struct bio *bio;
+	int blocks = size / journal->block_size;
+
+	/*
+	 * if PAGE_SIZE > block size, there might be one block size hole at the
+	 * tail. Recover code should be aware of this
+	 * */
+	if (trans->log_end + blocks > journal->last_block) {
+		pos = BLOCK_TO_SECTOR(journal->first_block, journal);
+		goto allocate_bio;
+	}
+
+retry:
+	bio = current_io->current_bio;
+	if (!bio)
+		goto allocate_bio;
+
+	if (!bio_add_page(bio, page, size, 0))
+		goto allocate_bio;
+
+	if (trans->log_end + blocks > journal->last_block)
+		trans->log_end = journal->first_block;
+	trans->log_end += blocks;
+	return 0;
+allocate_bio:
+	current_io->current_bio = NULL;
+	bio = bio_alloc_mddev(GFP_NOIO,
+		MAX_TAG_PER_BLOCK(trans->journal), trans->journal->mddev);
+
+	bio->bi_bdev = journal->rdev->bdev;
+	bio->bi_iter.bi_sector = pos + journal->rdev->data_offset;
+	bio->bi_private = current_io;
+	bio->bi_end_io = r5log_end_io;
+	
+	bio_list_add(&current_io->bios, bio);
+	atomic_inc(&current_io->refcnt);
+	current_io->current_bio = bio;
+	goto retry;
+}
+
+static int r5log_transaction_get_tag(struct r5log_transaction *trans)
+{
+	struct r5log_io_control *current_io;
+	struct r5log_meta_block *meta;
+
+	current_io = trans->current_io;
+	if (current_io && current_io->tag_index >=
+	    MAX_TAG_PER_BLOCK(trans->journal))
+		r5log_submit_io(trans);
+
+	current_io = trans->current_io;
+	if (current_io)
+		return 0;
+
+	current_io = kmalloc(sizeof(*current_io), GFP_NOIO|__GFP_REPEAT);
+	if (!current_io)
+		return -ENOMEM;
+	current_io->meta_page = alloc_page(GFP_NOIO|__GFP_ZERO|__GFP_REPEAT);
+	if (!current_io->meta_page) {
+		kfree(current_io);
+		return -ENOMEM;
+	}
+
+	current_io->trans = trans;
+	current_io->commit_page = NULL;
+	current_io->tag_index = 0;
+	bio_list_init(&current_io->bios);
+	current_io->current_bio = NULL;
+	atomic_set(&current_io->refcnt, 1);
+
+	atomic_inc(&trans->io_pending);
+	trans->current_io = current_io;
+
+	r5log_io_add_page(trans, current_io->meta_page,
+				trans->journal->block_size);
+
+	meta = kmap_atomic(current_io->meta_page);
+	meta->header.magic = cpu_to_le32(RAID5_LOG_MAGIC);
+	meta->header.type = cpu_to_le32(R5LOG_TYPE_META);
+	meta->header.transaction_id =
+			cpu_to_le64(trans->transaction_id);
+	/* we never hit journal->first_block */
+	meta->header.position = cpu_to_le64(trans->log_end - 1);
+	kunmap_atomic(meta);
+	return 0;
+}
+
+int r5log_add_stripe_page(struct r5log_transaction *trans,
+	struct stripe_head *sh, int disk_index)
+{
+	struct r5log_meta_block_tag *tag;
+	struct r5log_meta_block *meta;
+	struct r5dev *rdev;
+	struct r5log_io_control *current_io;
+	u32 crc;
+
+	rdev = &sh->dev[disk_index];
+
+	if (r5log_transaction_get_tag(trans))
+		return -ENOMEM;
+	current_io = trans->current_io;
+
+	crc = r5log_calculate_checksum(trans->journal, rdev->log_checksum,
+		trans->journal->uuid, sizeof(trans->journal->uuid), true);
+	crc = r5log_calculate_checksum(trans->journal, crc,
+		&trans->transaction_id, sizeof(trans->transaction_id), true);
+
+	meta = kmap_atomic(current_io->meta_page);
+	tag = &meta->tags[current_io->tag_index];
+	if (test_bit(R5_Discard, &rdev->flags))
+		tag->flags |= R5LOG_TAG_FLAG_DISCARD;
+	if (test_bit(R5_Escaped, &rdev->flags))
+		tag->flags |= R5LOG_TAG_FLAG_ESCAPED;
+	tag->flags = cpu_to_le32(tag->flags);
+	tag->disk_index = cpu_to_le32(disk_index);
+	tag->disk_sector = cpu_to_le64(sh->sector);
+	kunmap_atomic(meta);
+
+	current_io->tag_index++;
+	r5log_io_add_page(trans, rdev->page, PAGE_SIZE);
+	return 0;
+}
+
+static int r5log_journal_one_stripe(struct r5log_journal *journal,
+	struct stripe_head *sh)
+{
+	struct r5log_transaction *trans;
+	int i;
+	int data_disks = 0;
+
+	for (i = 0; i < sh->disks; i++) {
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+			continue;
+		if (test_bit(R5_Discard, &sh->dev[i].flags))
+			continue;
+		data_disks++;
+	}
+
+	trans = r5log_start_transaction(journal,
+			data_disks * PAGE_TO_BLOCKS(journal));
+	if (!trans)
+		goto abort_trans;
+	for (i = 0; i < sh->disks; i++) {
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+			continue;
+		if (r5log_add_stripe_page(trans, sh, i))
+			goto abort_trans;
+	}
+
+	list_add_tail(&sh->log_list, &trans->stripe_list);
+	trans->stripe_cnt++;
+	sh->log_trans = trans;
+
+	r5log_end_transaction(trans);
+	return 0;
+abort_trans:
+	journal->aborted = 1;
+	printk(KERN_ERR"r5log journal failed\n");
+	r5log_restore_stripe_data(sh);
+	/* skip journal is still ok, but lose the protection */
+	set_bit(STRIPE_HANDLE, &sh->state);
+	release_stripe(sh);
+	return -ENOMEM;
+}
+
+static void r5log_journal_thread(struct md_thread *thread)
+{
+	struct mddev *mddev = thread->mddev;
+        struct r5conf *conf = mddev->private;
+	struct r5log_journal *journal = conf->journal;
+	struct r5log_transaction *trans;
+	struct stripe_head *sh;
+	LIST_HEAD(stripe_list);
+	struct blk_plug plug;
+	bool did_something;
+
+	blk_start_plug(&plug);
+again:
+	did_something = false;
+	spin_lock(&journal->stripes_lock);
+	list_splice_init(&journal->pending_stripes, &stripe_list);
+	spin_unlock(&journal->stripes_lock);
+
+	while (!list_empty(&stripe_list)) {
+		sh = list_first_entry(&stripe_list, struct stripe_head, log_list);
+		list_del_init(&sh->log_list);
+		r5log_journal_one_stripe(journal, sh);
+		did_something = true;
+	}
+
+	if (journal->do_commit || time_after(jiffies,
+	    journal->last_commit_time + TRANSACTION_TIMEOUT)) {
+		mutex_lock(&journal->jlock);
+		r5log_check_and_freeze_current_transaction(journal);
+		trans = journal->current_trans;
+		if (trans)
+			mutex_lock(&trans->tlock);
+		mutex_unlock(&journal->jlock);
+		journal->do_commit = 0;
+		journal->last_commit_time = jiffies;
+
+		if (trans) {
+			r5log_do_commit(trans);
+			mutex_unlock(&trans->tlock);
+
+			did_something = true;
+		}
+	}
+
+	if (did_something)
+		goto again;
+	blk_finish_plug(&plug);
+}
+
+
+int r5log_write_stripe(struct r5log_journal *journal, struct stripe_head *sh)
+{
+	int i;
+	int write_disks = 0;
+
+	/* fulls stripe write doesn't have write hole issue */
+	if (sh->log_trans || test_bit(STRIPE_FULL_WRITE, &sh->state))
+		return -EAGAIN;
+	if (journal->aborted)
+		return -ENODEV;
+
+	for (i = 0; i < sh->disks; i++) {
+		void *addr;
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+			continue;
+		write_disks++;
+		if (test_bit(R5_Discard, &sh->dev[i].flags)) {
+			sh->dev[i].log_checksum = ~0;
+			continue;
+		}
+		addr = kmap_atomic(sh->dev[i].page);
+
+		if (*(__le32 *)addr == cpu_to_le32(RAID5_LOG_MAGIC)) {
+			*(u32 *)addr = 0;
+			set_bit(R5_Escaped, &sh->dev[i].flags);
+		}
+		sh->dev[i].log_checksum = r5log_calculate_checksum(journal,
+			~0, addr, PAGE_SIZE, true);
+		kunmap_atomic(addr);
+	}
+	if (!write_disks)
+		return -EINVAL;
+
+	atomic_inc(&sh->count);
+
+	/*
+	 * this function shouldn't wait on journal related stuff, journal
+	 * checkpoint might wait for stripes to handle
+	 **/
+	spin_lock(&journal->stripes_lock);
+	list_add_tail(&sh->log_list, &journal->pending_stripes);
+	spin_unlock(&journal->stripes_lock);
+	md_wakeup_thread(journal->thread);
+
+	return 0;
+}
+
+static void r5log_wait_transaction(struct r5log_transaction *trans,
+	int state)
+{
+	wait_event(trans->wait_state, trans->state >= state);
+}
+
+static int r5log_do_commit(struct r5log_transaction *trans)
+{
+	struct r5log_journal *journal = trans->journal;
+	struct r5log_io_control *current_io;
+	struct page *page;
+	struct r5log_commit_block *commit;
+	struct timespec now = current_kernel_time();
+	u32 crc;
+
+	BUG_ON(!mutex_is_locked(&trans->tlock));
+
+	if (trans->state >= TRANSACTION_COMMITED)
+		return 0;
+
+	trans->stripe_cnt = 0;
+
+	journal->last_commit_time = jiffies;
+
+	current_io = trans->current_io;
+	/* The transaction hasn't done anything yet */
+	if (!current_io)
+		BUG();
+
+	page = alloc_page(GFP_NOIO|__GFP_ZERO|__GFP_REPEAT);
+	r5log_io_add_page(trans, page, journal->block_size);
+	current_io->commit_page = page;
+
+	commit = kmap_atomic(page);
+	commit->header.magic = cpu_to_le32(RAID5_LOG_MAGIC);
+	commit->header.type = cpu_to_le32(R5LOG_TYPE_COMMIT);
+	commit->header.transaction_id =
+			cpu_to_le64(trans->transaction_id);
+	commit->commit_sec = cpu_to_le64(now.tv_sec);
+	commit->commit_nsec = cpu_to_le64(now.tv_nsec);
+	commit->header.position = cpu_to_le64(trans->log_end - 1);
+
+	crc = r5log_calculate_checksum(journal, ~0, commit,
+		journal->block_size, false);
+	crc = r5log_calculate_checksum(journal, crc, journal->uuid,
+		sizeof(journal->uuid), false);
+	commit->header.checksum = cpu_to_le32(crc);
+	kunmap_atomic(commit);
+
+	r5log_submit_io(trans);
+	trans->state = TRANSACTION_COMMITED;
+	r5log_transaction_finish_io(trans);
+
+	return 0;
+}
+
+static void r5log_start_commit(struct r5log_journal *journal)
+{
+	journal->do_commit = 1;
+	md_wakeup_thread(journal->thread);
+}
+
+static void r5log_disks_flush_end(struct bio *bio, int err)
+{
+	struct completion *io_complete = bio->bi_private;
+
+	complete(io_complete);
+	bio_put(bio);
+}
+
+static void r5log_flush_all_disks(struct r5log_journal *journal)
+{
+	struct mddev *mddev = journal->mddev;
+	struct bio *bi;
+	DECLARE_COMPLETION_ONSTACK(io_complete);
+
+	bi = bio_alloc_mddev(GFP_NOIO, 0, mddev);
+	bi->bi_end_io = r5log_disks_flush_end;
+	bi->bi_private = &io_complete;
+
+	md_flush_request(mddev, bi);
+
+	wait_for_completion_io(&io_complete);
+}
+
+static void r5log_discard_blocks(struct r5log_journal *journal,
+	u64 start, u64 end)
+{
+	if (!journal->do_discard)
+		return;
+	if (start < end) {
+		blkdev_issue_discard(journal->rdev->bdev,
+			BLOCK_TO_SECTOR(start, journal),
+			BLOCK_TO_SECTOR(end - start, journal),
+			GFP_NOIO, 0);
+	} else {
+		blkdev_issue_discard(journal->rdev->bdev,
+			BLOCK_TO_SECTOR(start, journal),
+			BLOCK_TO_SECTOR(journal->last_block - start, journal),
+			GFP_NOIO, 0);
+		blkdev_issue_discard(journal->rdev->bdev,
+			BLOCK_TO_SECTOR(journal->first_block, journal),
+			BLOCK_TO_SECTOR(end - journal->first_block, journal),
+			GFP_NOIO, 0);
+	}
+}
+
+static int r5log_do_checkpoint(struct r5log_journal *journal,
+	u64 blocks)
+{
+	u64 cp_block = journal->last_checkpoint;
+	u64 cp_tid = journal->checkpoint_transaction_id;
+	struct r5log_transaction *trans;
+	bool enough = false;
+	u64 freed = 0;
+
+	BUG_ON(!mutex_is_locked(&journal->jlock));
+
+	trans = journal->current_trans;
+	if (trans && r5log_ring_size(journal, journal->last_checkpoint,
+	    trans->log_start) < blocks) {
+		mutex_lock(&trans->tlock);
+		r5log_do_commit(trans);
+		mutex_unlock(&trans->tlock);
+	}
+
+	r5log_check_and_freeze_current_transaction(journal);
+
+	while (!list_empty(&journal->transaction_list)) {
+		trans = list_first_entry(&journal->transaction_list,
+			struct r5log_transaction, next_trans);
+
+		if (cp_tid + 1 != trans->transaction_id ||
+		       cp_block != trans->log_start) {
+			BUG();
+		}
+
+		if (!enough)
+			r5log_wait_transaction(trans,
+				TRANSACTION_CHECKPOINT_READY);
+		else if (trans->state < TRANSACTION_CHECKPOINT_READY)
+			break;
+		cp_block = trans->log_end;
+		cp_tid = trans->transaction_id;
+
+		freed += r5log_transaction_size(trans);
+		list_del(&trans->next_trans);
+		kfree(trans);
+
+		journal->transaction_cnt--;
+
+		if (freed > blocks)
+			enough = true;
+	}
+	/* Nothing happened */
+	if (journal->checkpoint_transaction_id == cp_tid)
+		return 0;
+
+	r5log_discard_blocks(journal, journal->last_checkpoint, cp_block);
+
+	r5log_flush_all_disks(journal);
+
+	journal->checkpoint_transaction_id = cp_tid;
+	journal->last_checkpoint = cp_block;
+	/* teardown the journal */
+	if (blocks == -1)
+		journal->last_checkpoint = 0;
+	/* FIXME: trim the range for SSD */
+	r5log_write_super(journal);
+	return 0;
+}
+
+void r5log_stripe_write_finished(struct stripe_head *sh)
+{
+	struct r5log_transaction *trans = sh->log_trans;
+
+	sh->log_trans = NULL;
+	r5log_transaction_finish_stripe(trans);
+}
+
+void r5log_flush_transaction(struct r5log_journal *journal)
+{
+	if (journal->aborted)
+		return;
+	r5log_start_commit(journal);
+}
+
+int r5log_handle_flush_request(struct mddev *mddev, struct bio *bio)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5log_journal *journal = conf->journal;
+
+	if (journal->aborted)
+		return -ENODEV;
+
+	/*
+	 * we flush disk cache and then release_stripe. So if a stripe is
+	 * finished, the disk cache is flushed already, so we don't need flush
+	 * again
+	 * */
+	if (bio->bi_iter.bi_size == 0) {
+		bio_endio(bio, 0);
+		return 0;
+	}
+	bio->bi_rw &= ~REQ_FLUSH;
+	return -EAGAIN;
+}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..153cb9c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -409,7 +409,7 @@ static int release_stripe_list(struct r5conf *conf,
 	return count;
 }
 
-static void release_stripe(struct stripe_head *sh)
+void release_stripe(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	unsigned long flags;
@@ -741,6 +741,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	might_sleep();
 
+	if (!sh->log_trans && conf->journal) {
+		if (!r5log_write_stripe(conf->journal, sh))
+			return;
+	}
 	for (i = disks; i--; ) {
 		int rw;
 		int replace_only = 0;
@@ -4111,6 +4115,8 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
+	if (s.return_bi && sh->log_trans)
+		r5log_stripe_write_finished(sh);
 	return_io(s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
@@ -4650,8 +4656,17 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	bool do_prepare;
 
 	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
-		md_flush_request(mddev, bi);
-		return;
+		int ret = -ENODEV;
+		if (conf->journal) {
+			ret = r5log_handle_flush_request(mddev, bi);
+			if (!ret)
+				return;
+		}
+		if (!conf->journal || ret == -ENODEV) {
+			md_flush_request(mddev, bi);
+			return;
+		}
+		BUG_ON(ret != -EAGAIN);
 	}
 
 	md_write_start(mddev, bi);
@@ -5283,6 +5298,8 @@ static void raid5_do_work(struct work_struct *work)
 	spin_unlock_irq(&conf->device_lock);
 	blk_finish_plug(&plug);
 
+	if (conf->journal)
+		r5log_flush_transaction(conf->journal);
 	pr_debug("--- raid5worker inactive\n");
 }
 
@@ -5354,6 +5371,9 @@ static void raid5d(struct md_thread *thread)
 	async_tx_issue_pending_all();
 	blk_finish_plug(&plug);
 
+	if (conf->journal)
+		r5log_flush_transaction(conf->journal);
+
 	pr_debug("--- raid5d inactive\n");
 }
 
@@ -5740,6 +5760,9 @@ static void raid5_free_percpu(struct r5conf *conf)
 
 static void free_conf(struct r5conf *conf)
 {
+	if (conf->journal)
+		r5log_free_journal(conf->journal);
+
 	free_thread_groups(conf);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
@@ -5811,7 +5834,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 {
 	struct r5conf *conf;
 	int raid_disk, memory, max_disks;
-	struct md_rdev *rdev;
+	struct md_rdev *rdev, *spare_rdev = NULL;
 	struct disk_info *disk;
 	char pers_name[6];
 	int i;
@@ -5914,6 +5937,12 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 
 	rdev_for_each(rdev, mddev) {
 		raid_disk = rdev->raid_disk;
+		if (raid_disk < 0) {
+			char b[BDEVNAME_SIZE];
+			spare_rdev = rdev;
+			printk(KERN_INFO "using device %s as log\n",
+				bdevname(rdev->bdev, b));
+		}
 		if (raid_disk >= max_disks
 		    || raid_disk < 0)
 			continue;
@@ -5973,6 +6002,9 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 		goto abort;
 	}
 
+	if (spare_rdev)
+		conf->journal = r5log_load_journal(spare_rdev);
+
 	return conf;
 
  abort:
@@ -6873,6 +6905,8 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 				    lock_all_device_hash_locks_irq(conf));
 		conf->quiesce = 1;
 		unlock_all_device_hash_locks_irq(conf);
+		if (conf->journal)
+			r5log_flush_journal(conf->journal);
 		/* allow reshape to continue */
 		wake_up(&conf->wait_for_overlap);
 		break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 983e18a..93cfb5a 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -198,6 +198,7 @@ struct stripe_head {
 	struct hlist_node	hash;
 	struct list_head	lru;	      /* inactive_list or handle_list */
 	struct llist_node	release_list;
+	struct list_head	log_list;
 	struct r5conf		*raid_conf;
 	short			generation;	/* increments with every
 						 * reshape */
@@ -215,6 +216,7 @@ struct stripe_head {
 	spinlock_t		stripe_lock;
 	int			cpu;
 	struct r5worker_group	*group;
+	struct r5log_transaction *log_trans;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -236,6 +238,7 @@ struct stripe_head {
 		struct bio	*toread, *read, *towrite, *written;
 		sector_t	sector;			/* sector of this page */
 		unsigned long	flags;
+		u32 log_checksum;
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
 };
 
@@ -300,6 +303,7 @@ enum r5dev_flags {
 			 */
 	R5_Discard,	/* Discard the stripe */
 	R5_SkipCopy,	/* Don't copy data from bio to stripe cache */
+	R5_Escaped,	/* data is escaped by log */
 };
 
 /*
@@ -495,6 +499,8 @@ struct r5conf {
 	struct r5worker_group	*worker_groups;
 	int			group_cnt;
 	int			worker_cnt_per_group;
+
+	struct r5log_journal	*journal;
 };
 
 /*
@@ -560,4 +566,14 @@ static inline int algorithm_is_DDF(int layout)
 
 extern void md_raid5_kick_device(struct r5conf *conf);
 extern int raid5_set_cache_size(struct mddev *mddev, int size);
+
+void release_stripe(struct stripe_head *sh);
+
+struct r5log_journal *r5log_load_journal(struct md_rdev *rdev);
+int r5log_flush_journal(struct r5log_journal *journal);
+void r5log_free_journal(struct r5log_journal *journal);
+int r5log_write_stripe(struct r5log_journal *journal, struct stripe_head *sh);
+int r5log_handle_flush_request(struct mddev *mddev, struct bio *bio);
+void r5log_stripe_write_finished(struct stripe_head *sh);
+void r5log_flush_transaction(struct r5log_journal *journal);
 #endif
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 49f4210..84abf0c 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -305,4 +305,68 @@ struct mdp_superblock_1 {
 					|MD_FEATURE_RECOVERY_BITMAP	\
 					)
 
+/* all disk position of below struct start from rdev->start_offset */
+struct r5log_meta_header {
+	__le32 magic;
+	__le32 type;
+	__le32 checksum; /* checksum(metadata block + uuid) */
+	__le32 zero_padding;
+	__le64 transaction_id;
+	__le64 position; /* block number the meta is written */
+} __attribute__ ((__packed__));
+
+#define RAID5_LOG_VERSION 0x1
+#define RAID5_LOG_MAGIC 0x28670308
+
+enum {
+	R5LOG_TYPE_META = 0,
+	R5LOG_TYPE_COMMIT = 1,
+	R5LOG_TYPE_SUPER = 2,
+};
+
+struct r5log_super_block {
+	struct r5log_meta_header header;
+	__le32 version;
+	__le32 stripe_size;
+	__le32 block_size;
+	__le64 total_blocks;
+	__le64 first_block;
+	__le64 last_checkpoint;
+	__le64 update_time_sec;
+	__le64 update_time_nsec;
+	__u8 meta_checksum_type;
+	__u8 data_checksum_type;
+	__u8 uuid[16];
+} __attribute__ ((__packed__));
+
+enum {
+	R5LOG_CHECKSUM_CRC32 = 0,
+	R5LOG_CHECKSUM_NR = 1,
+};
+
+struct r5log_meta_block_tag {
+	__le32 flags;
+	__le32 disk_index;
+	__le32 checksum; /* checksum(data + uuid + transaction_id) */
+	__le32 zero_padding;
+	__le64 disk_sector; /* raid disk sector */
+} __attribute__ ((__packed__));
+
+enum {
+	R5LOG_TAG_FLAG_ESCAPED = 1 << 0, /* data is escaped */
+	R5LOG_TAG_FLAG_LAST_TAG = 1 << 1, /* last tag in the meta block */
+	R5LOG_TAG_FLAG_DISCARD = 1 << 2, /* a discard request, no data */
+};
+
+struct r5log_meta_block {
+	struct r5log_meta_header header;
+	struct r5log_meta_block_tag tags[];
+} __attribute__ ((__packed__));
+
+struct r5log_commit_block {
+	struct r5log_meta_header header;
+	__le64 commit_sec;
+	__le64 commit_nsec;
+} __attribute__ ((__packed__));
+
 #endif
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-03-30 22:25 [RFC] raid5: add a log device to fix raid5/6 write hole issue Shaohua Li
@ 2015-04-01  3:47 ` Dan Williams
  2015-04-01  5:53   ` Shaohua Li
  2015-04-01 18:36   ` Piergiorgio Sartor
  2015-04-01 21:53 ` NeilBrown
  1 sibling, 2 replies; 22+ messages in thread
From: Dan Williams @ 2015-04-01  3:47 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Neil Brown, linux-raid, Song Liu, Kernel-team

On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
> This is my attempt to fix raid5/6 write hole issue, it's not for merge
> yet, I post it out for comments. Any comments and suggestions are
> welcome!
>
> Thanks,
> Shaohua
>
> We expect a completed raid5/6 stack with reliability and high
> performance. Currently raid5/6 has 2 issues:
>
> 1. read-modify-write for small size IO. To fix this issue, a cache layer
> above raid5/6 can be used to aggregate write to full stripe write.
> 2. write hole issue. A write log below raid5/6 can fix the issue.
>
> We plan to use a SSD to fix the two issues. Here we just fix the write
> hole issue.
>
> 1. We don't try to fix the issues together. A cache layer will do write
> acceleration. A log layer will fix write hole. The seperation will
> simplify things a lot.
>
> 2. Current assumption is flashcache/bcache will be used as the cache
> layer. If they don't work well, we can fix them or add a simple cache
> layer for raid write aggregation later. We also assume cache layer will
> absorb write, so log doesn't worry about write latency.

It seems neither bcache nor dm-cache are tackling the write-buffering
problem head on... they still seem to be concerned with some amount of
read caching which I can see as useful for file servers and
workstations, but not necessarily scale out storage.

I'll try to set aside time to take a look at the patch this week.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01  3:47 ` Dan Williams
@ 2015-04-01  5:53   ` Shaohua Li
  2015-04-01  6:02     ` NeilBrown
  2015-04-01 18:36   ` Piergiorgio Sartor
  1 sibling, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-01  5:53 UTC (permalink / raw)
  To: Dan Williams; +Cc: Neil Brown, linux-raid, Song Liu, Kernel-team

On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
> On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
> > This is my attempt to fix raid5/6 write hole issue, it's not for merge
> > yet, I post it out for comments. Any comments and suggestions are
> > welcome!
> >
> > Thanks,
> > Shaohua
> >
> > We expect a completed raid5/6 stack with reliability and high
> > performance. Currently raid5/6 has 2 issues:
> >
> > 1. read-modify-write for small size IO. To fix this issue, a cache layer
> > above raid5/6 can be used to aggregate write to full stripe write.
> > 2. write hole issue. A write log below raid5/6 can fix the issue.
> >
> > We plan to use a SSD to fix the two issues. Here we just fix the write
> > hole issue.
> >
> > 1. We don't try to fix the issues together. A cache layer will do write
> > acceleration. A log layer will fix write hole. The seperation will
> > simplify things a lot.
> >
> > 2. Current assumption is flashcache/bcache will be used as the cache
> > layer. If they don't work well, we can fix them or add a simple cache
> > layer for raid write aggregation later. We also assume cache layer will
> > absorb write, so log doesn't worry about write latency.
> 
> It seems neither bcache nor dm-cache are tackling the write-buffering
> problem head on... they still seem to be concerned with some amount of
> read caching which I can see as useful for file servers and
> workstations, but not necessarily scale out storage.
> 
> I'll try to set aside time to take a look at the patch this week.

Thanks! The cache layer is definitely what I'll focus on next. bcache
supports writeback, I guess we can add an option to skip read data from
backing disks for read caching if it's possible. Another option is
writting a simple caching just for raid 5/6 write aggregation. We can
append all data to a log, and maintain an index in memory. At raid
shutdown, we can flush all data to raid disks, the index doesn't need
presistent in disk, which makes the caching fairly simple.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01  5:53   ` Shaohua Li
@ 2015-04-01  6:02     ` NeilBrown
  2015-04-01 17:14       ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2015-04-01  6:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dan Williams, linux-raid, Song Liu, Kernel-team

[-- Attachment #1: Type: text/plain, Size: 2373 bytes --]

On Tue, 31 Mar 2015 22:53:21 -0700 Shaohua Li <shli@fb.com> wrote:

> On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
> > On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
> > > This is my attempt to fix raid5/6 write hole issue, it's not for merge
> > > yet, I post it out for comments. Any comments and suggestions are
> > > welcome!
> > >
> > > Thanks,
> > > Shaohua
> > >
> > > We expect a completed raid5/6 stack with reliability and high
> > > performance. Currently raid5/6 has 2 issues:
> > >
> > > 1. read-modify-write for small size IO. To fix this issue, a cache layer
> > > above raid5/6 can be used to aggregate write to full stripe write.
> > > 2. write hole issue. A write log below raid5/6 can fix the issue.
> > >
> > > We plan to use a SSD to fix the two issues. Here we just fix the write
> > > hole issue.
> > >
> > > 1. We don't try to fix the issues together. A cache layer will do write
> > > acceleration. A log layer will fix write hole. The seperation will
> > > simplify things a lot.
> > >
> > > 2. Current assumption is flashcache/bcache will be used as the cache
> > > layer. If they don't work well, we can fix them or add a simple cache
> > > layer for raid write aggregation later. We also assume cache layer will
> > > absorb write, so log doesn't worry about write latency.
> > 
> > It seems neither bcache nor dm-cache are tackling the write-buffering
> > problem head on... they still seem to be concerned with some amount of
> > read caching which I can see as useful for file servers and
> > workstations, but not necessarily scale out storage.
> > 
> > I'll try to set aside time to take a look at the patch this week.
> 
> Thanks! The cache layer is definitely what I'll focus on next. bcache
> supports writeback, I guess we can add an option to skip read data from
> backing disks for read caching if it's possible. Another option is
> writting a simple caching just for raid 5/6 write aggregation. We can
> append all data to a log, and maintain an index in memory. At raid
> shutdown, we can flush all data to raid disks, the index doesn't need
> presistent in disk, which makes the caching fairly simple.

Surely if the index doesn't need to persist in disk, then the data doesn't
either, as without the index you cannot find the data...

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01  6:02     ` NeilBrown
@ 2015-04-01 17:14       ` Shaohua Li
  0 siblings, 0 replies; 22+ messages in thread
From: Shaohua Li @ 2015-04-01 17:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, linux-raid, Song Liu, Kernel-team

On Wed, Apr 01, 2015 at 05:02:56PM +1100, NeilBrown wrote:
> On Tue, 31 Mar 2015 22:53:21 -0700 Shaohua Li <shli@fb.com> wrote:
> 
> > On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
> > > On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
> > > > This is my attempt to fix raid5/6 write hole issue, it's not for merge
> > > > yet, I post it out for comments. Any comments and suggestions are
> > > > welcome!
> > > >
> > > > Thanks,
> > > > Shaohua
> > > >
> > > > We expect a completed raid5/6 stack with reliability and high
> > > > performance. Currently raid5/6 has 2 issues:
> > > >
> > > > 1. read-modify-write for small size IO. To fix this issue, a cache layer
> > > > above raid5/6 can be used to aggregate write to full stripe write.
> > > > 2. write hole issue. A write log below raid5/6 can fix the issue.
> > > >
> > > > We plan to use a SSD to fix the two issues. Here we just fix the write
> > > > hole issue.
> > > >
> > > > 1. We don't try to fix the issues together. A cache layer will do write
> > > > acceleration. A log layer will fix write hole. The seperation will
> > > > simplify things a lot.
> > > >
> > > > 2. Current assumption is flashcache/bcache will be used as the cache
> > > > layer. If they don't work well, we can fix them or add a simple cache
> > > > layer for raid write aggregation later. We also assume cache layer will
> > > > absorb write, so log doesn't worry about write latency.
> > > 
> > > It seems neither bcache nor dm-cache are tackling the write-buffering
> > > problem head on... they still seem to be concerned with some amount of
> > > read caching which I can see as useful for file servers and
> > > workstations, but not necessarily scale out storage.
> > > 
> > > I'll try to set aside time to take a look at the patch this week.
> > 
> > Thanks! The cache layer is definitely what I'll focus on next. bcache
> > supports writeback, I guess we can add an option to skip read data from
> > backing disks for read caching if it's possible. Another option is
> > writting a simple caching just for raid 5/6 write aggregation. We can
> > append all data to a log, and maintain an index in memory. At raid
> > shutdown, we can flush all data to raid disks, the index doesn't need
> > presistent in disk, which makes the caching fairly simple.
> 
> Surely if the index doesn't need to persist in disk, then the data doesn't
> either, as without the index you cannot find the data...

I mean not just pure data. We can store tuple (disk offset, length,
data) to disk. index will be used to speed up search. If there is a
crash, we can rebuild the index using the tuple.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01  3:47 ` Dan Williams
  2015-04-01  5:53   ` Shaohua Li
@ 2015-04-01 18:36   ` Piergiorgio Sartor
  2015-04-01 18:46     ` Dan Williams
  2015-04-01 18:46     ` Alireza Haghdoost
  1 sibling, 2 replies; 22+ messages in thread
From: Piergiorgio Sartor @ 2015-04-01 18:36 UTC (permalink / raw)
  To: Dan Williams; +Cc: Shaohua Li, Neil Brown, linux-raid, Song Liu, Kernel-team

On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
> On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
> > This is my attempt to fix raid5/6 write hole issue, it's not for merge
> > yet, I post it out for comments. Any comments and suggestions are
> > welcome!
> >
> > Thanks,
> > Shaohua
> >
> > We expect a completed raid5/6 stack with reliability and high
> > performance. Currently raid5/6 has 2 issues:
> >
> > 1. read-modify-write for small size IO. To fix this issue, a cache layer
> > above raid5/6 can be used to aggregate write to full stripe write.
> > 2. write hole issue. A write log below raid5/6 can fix the issue.
> >
> > We plan to use a SSD to fix the two issues. Here we just fix the write
> > hole issue.
> >
> > 1. We don't try to fix the issues together. A cache layer will do write
> > acceleration. A log layer will fix write hole. The seperation will
> > simplify things a lot.
> >
> > 2. Current assumption is flashcache/bcache will be used as the cache
> > layer. If they don't work well, we can fix them or add a simple cache
> > layer for raid write aggregation later. We also assume cache layer will
> > absorb write, so log doesn't worry about write latency.
> 
> It seems neither bcache nor dm-cache are tackling the write-buffering
> problem head on... they still seem to be concerned with some amount of
> read caching which I can see as useful for file servers and
> workstations, but not necessarily scale out storage.
> 
> I'll try to set aside time to take a look at the patch this week.

There is one thing I do not really get.

The target is to avoid the "write hole", which happens,
for example, when there is a sudden power failure.

Now, how can be assured, in that case, that the "cache"
device is safe after the power is restored?

Doesn't this solution just shifts the problem from
the array to a different device (SSD, for example)?

Speaking of SSD, these are quite "power failure"
sensitive, it seems...

Thanks,

bye,

pg

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 18:36   ` Piergiorgio Sartor
@ 2015-04-01 18:46     ` Dan Williams
  2015-04-01 20:07       ` Jiang, Dave
  2015-04-01 18:46     ` Alireza Haghdoost
  1 sibling, 1 reply; 22+ messages in thread
From: Dan Williams @ 2015-04-01 18:46 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Shaohua Li, Neil Brown, linux-raid, Song Liu, Kernel-team

On Wed, Apr 1, 2015 at 11:36 AM, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
>> On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
>> > This is my attempt to fix raid5/6 write hole issue, it's not for merge
>> > yet, I post it out for comments. Any comments and suggestions are
>> > welcome!
>> >
>> > Thanks,
>> > Shaohua
>> >
>> > We expect a completed raid5/6 stack with reliability and high
>> > performance. Currently raid5/6 has 2 issues:
>> >
>> > 1. read-modify-write for small size IO. To fix this issue, a cache layer
>> > above raid5/6 can be used to aggregate write to full stripe write.
>> > 2. write hole issue. A write log below raid5/6 can fix the issue.
>> >
>> > We plan to use a SSD to fix the two issues. Here we just fix the write
>> > hole issue.
>> >
>> > 1. We don't try to fix the issues together. A cache layer will do write
>> > acceleration. A log layer will fix write hole. The seperation will
>> > simplify things a lot.
>> >
>> > 2. Current assumption is flashcache/bcache will be used as the cache
>> > layer. If they don't work well, we can fix them or add a simple cache
>> > layer for raid write aggregation later. We also assume cache layer will
>> > absorb write, so log doesn't worry about write latency.
>>
>> It seems neither bcache nor dm-cache are tackling the write-buffering
>> problem head on... they still seem to be concerned with some amount of
>> read caching which I can see as useful for file servers and
>> workstations, but not necessarily scale out storage.
>>
>> I'll try to set aside time to take a look at the patch this week.
>
> There is one thing I do not really get.
>
> The target is to avoid the "write hole", which happens,
> for example, when there is a sudden power failure.
>
> Now, how can be assured, in that case, that the "cache"
> device is safe after the power is restored?

If you lose the cache the data-loss damage is greater, but this has
always been the case with hardware-raid adapters.

> Doesn't this solution just shifts the problem from
> the array to a different device (SSD, for example)?
>
> Speaking of SSD, these are quite "power failure"
> sensitive, it seems...

Simple, if a cache-device is not itself power-failure safe then it
should not be used for power-failure protection.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 18:36   ` Piergiorgio Sartor
  2015-04-01 18:46     ` Dan Williams
@ 2015-04-01 18:46     ` Alireza Haghdoost
  2015-04-01 19:57       ` Wols Lists
  1 sibling, 1 reply; 22+ messages in thread
From: Alireza Haghdoost @ 2015-04-01 18:46 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Dan Williams, Shaohua Li, Neil Brown, linux-raid, Song Liu, Kernel-team

On Wed, Apr 1, 2015 at 1:36 PM, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
>> On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
>> > This is my attempt to fix raid5/6 write hole issue, it's not for merge
>> > yet, I post it out for comments. Any comments and suggestions are
>> > welcome!
>> >
>> > Thanks,
>> > Shaohua
>> >
>> > We expect a completed raid5/6 stack with reliability and high
>> > performance. Currently raid5/6 has 2 issues:
>> >
>> > 1. read-modify-write for small size IO. To fix this issue, a cache layer
>> > above raid5/6 can be used to aggregate write to full stripe write.
>> > 2. write hole issue. A write log below raid5/6 can fix the issue.
>> >
>> > We plan to use a SSD to fix the two issues. Here we just fix the write
>> > hole issue.
>> >
>> > 1. We don't try to fix the issues together. A cache layer will do write
>> > acceleration. A log layer will fix write hole. The seperation will
>> > simplify things a lot.
>> >
>> > 2. Current assumption is flashcache/bcache will be used as the cache
>> > layer. If they don't work well, we can fix them or add a simple cache
>> > layer for raid write aggregation later. We also assume cache layer will
>> > absorb write, so log doesn't worry about write latency.
>>
>> It seems neither bcache nor dm-cache are tackling the write-buffering
>> problem head on... they still seem to be concerned with some amount of
>> read caching which I can see as useful for file servers and
>> workstations, but not necessarily scale out storage.
>>
>> I'll try to set aside time to take a look at the patch this week.
>
> There is one thing I do not really get.
>
> The target is to avoid the "write hole", which happens,
> for example, when there is a sudden power failure.
>
> Now, how can be assured, in that case, that the "cache"
> device is safe after the power is restored?

You do sync write-ahead logging on the Flash cache. If it return
successful, you do fire the writes to the RAID. If system crash/fails
during the RAID writes (Write-hole), you just recover data by scanning
write-ahead log in the flash cache and replay the logs into the RAID
drives.

> Doesn't this solution just shifts the problem from
> the array to a different device (SSD, for example)?

I don't see such a shift. Enterprise hardware RAIDs also used similar
technique to fix the write-hole issue.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 18:46     ` Alireza Haghdoost
@ 2015-04-01 19:57       ` Wols Lists
  2015-04-01 20:04         ` Alireza Haghdoost
  2015-04-01 20:17         ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Wols Lists @ 2015-04-01 19:57 UTC (permalink / raw)
  To: Alireza Haghdoost, Piergiorgio Sartor
  Cc: Dan Williams, Shaohua Li, Neil Brown, linux-raid, Song Liu, Kernel-team

On 01/04/15 19:46, Alireza Haghdoost wrote:
>> Now, how can be assured, in that case, that the "cache"
>> > device is safe after the power is restored?
> You do sync write-ahead logging on the Flash cache. If it return
> successful, you do fire the writes to the RAID. If system crash/fails
> during the RAID writes (Write-hole), you just recover data by scanning
> write-ahead log in the flash cache and replay the logs into the RAID
> drives.
> 
Just to throw something nasty into the mix, I'm not sure whether it's
SSDs or SD-cards, but there certainly *was* a spate of corrupted
*controllers*.

In other words, a power failure would RELIABLY TRASH the device, if it
happened at the wrong moment. Hopefully that's been fixed ...

Cheers.
Wol

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 19:57       ` Wols Lists
@ 2015-04-01 20:04         ` Alireza Haghdoost
  2015-04-01 20:18           ` Wols Lists
  2015-04-01 20:17         ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Alireza Haghdoost @ 2015-04-01 20:04 UTC (permalink / raw)
  To: Wols Lists
  Cc: Piergiorgio Sartor, Dan Williams, Shaohua Li, Neil Brown,
	linux-raid, Song Liu, Kernel-team

On Wed, Apr 1, 2015 at 2:57 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 01/04/15 19:46, Alireza Haghdoost wrote:
>>> Now, how can be assured, in that case, that the "cache"
>>> > device is safe after the power is restored?
>> You do sync write-ahead logging on the Flash cache. If it return
>> successful, you do fire the writes to the RAID. If system crash/fails
>> during the RAID writes (Write-hole), you just recover data by scanning
>> write-ahead log in the flash cache and replay the logs into the RAID
>> drives.
>>
> Just to throw something nasty into the mix, I'm not sure whether it's
> SSDs or SD-cards, but there certainly *was* a spate of corrupted
> *controllers*.
>
> In other words, a power failure would RELIABLY TRASH the device, if it
> happened at the wrong moment. Hopefully that's been fixed ...
>

That is certainly true. As Dan mentioned, the cache device it-self
should be safe against power failure. I agree this is not the case for
all SSD cards in the market but might be the case for Facebook. I hate
to say this but It seems these efforts are useful dependent to what
kind of hardware is deployed for cache device.

--Alireza

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 18:46     ` Dan Williams
@ 2015-04-01 20:07       ` Jiang, Dave
  0 siblings, 0 replies; 22+ messages in thread
From: Jiang, Dave @ 2015-04-01 20:07 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-raid, neilb, Kernel-team, piergiorgio.sartor, shli, songliubraving


On Wed, 2015-04-01 at 18:46 +0000, Williams, Dan J wrote:
> On Wed, Apr 1, 2015 at 11:36 AM, Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> > On Tue, Mar 31, 2015 at 08:47:04PM -0700, Dan Williams wrote:
> >> On Mon, Mar 30, 2015 at 3:25 PM, Shaohua Li <shli@fb.com> wrote:
> >> > This is my attempt to fix raid5/6 write hole issue, it's not for merge
> >> > yet, I post it out for comments. Any comments and suggestions are
> >> > welcome!
> >> >
> >> > Thanks,
> >> > Shaohua
> >> >
> >> > We expect a completed raid5/6 stack with reliability and high
> >> > performance. Currently raid5/6 has 2 issues:
> >> >
> >> > 1. read-modify-write for small size IO. To fix this issue, a cache layer
> >> > above raid5/6 can be used to aggregate write to full stripe write.
> >> > 2. write hole issue. A write log below raid5/6 can fix the issue.
> >> >
> >> > We plan to use a SSD to fix the two issues. Here we just fix the write
> >> > hole issue.
> >> >
> >> > 1. We don't try to fix the issues together. A cache layer will do write
> >> > acceleration. A log layer will fix write hole. The seperation will
> >> > simplify things a lot.
> >> >
> >> > 2. Current assumption is flashcache/bcache will be used as the cache
> >> > layer. If they don't work well, we can fix them or add a simple cache
> >> > layer for raid write aggregation later. We also assume cache layer will
> >> > absorb write, so log doesn't worry about write latency.
> >>
> >> It seems neither bcache nor dm-cache are tackling the write-buffering
> >> problem head on... they still seem to be concerned with some amount of
> >> read caching which I can see as useful for file servers and
> >> workstations, but not necessarily scale out storage.
> >>
> >> I'll try to set aside time to take a look at the patch this week.
> >
> > There is one thing I do not really get.
> >
> > The target is to avoid the "write hole", which happens,
> > for example, when there is a sudden power failure.
> >
> > Now, how can be assured, in that case, that the "cache"
> > device is safe after the power is restored?
> 
> If you lose the cache the data-loss damage is greater, but this has
> always been the case with hardware-raid adapters.
> 
> > Doesn't this solution just shifts the problem from
> > the array to a different device (SSD, for example)?
> >
> > Speaking of SSD, these are quite "power failure"
> > sensitive, it seems...
> 
> Simple, if a cache-device is not itself power-failure safe then it
> should not be used for power-failure protection.

I think this would be a good application for some of the newer
technology coming out such as NVDIMM and persistent memory.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 19:57       ` Wols Lists
  2015-04-01 20:04         ` Alireza Haghdoost
@ 2015-04-01 20:17         ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2015-04-01 20:17 UTC (permalink / raw)
  To: Wols Lists, Alireza Haghdoost, Piergiorgio Sartor
  Cc: Dan Williams, Shaohua Li, Neil Brown, linux-raid, Song Liu, Kernel-team

On 04/01/2015 01:57 PM, Wols Lists wrote:
> On 01/04/15 19:46, Alireza Haghdoost wrote:
>>> Now, how can be assured, in that case, that the "cache"
>>>> device is safe after the power is restored?
>> You do sync write-ahead logging on the Flash cache. If it return
>> successful, you do fire the writes to the RAID. If system crash/fails
>> during the RAID writes (Write-hole), you just recover data by scanning
>> write-ahead log in the flash cache and replay the logs into the RAID
>> drives.
>>
> Just to throw something nasty into the mix, I'm not sure whether it's
> SSDs or SD-cards, but there certainly *was* a spate of corrupted
> *controllers*.
>
> In other words, a power failure would RELIABLY TRASH the device, if it
> happened at the wrong moment. Hopefully that's been fixed ...

You can't protect against shitty devices. If you care about power fail 
events, then you use hw that is specifically tested and vetted for that. 
And they do exist. They might just not be the cheapest you can find on 
newegg or similar places.

This potential problem isn't specific to what Shaohua is proposing, nor 
is it a show stopper for that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 20:04         ` Alireza Haghdoost
@ 2015-04-01 20:18           ` Wols Lists
  0 siblings, 0 replies; 22+ messages in thread
From: Wols Lists @ 2015-04-01 20:18 UTC (permalink / raw)
  To: Alireza Haghdoost
  Cc: Piergiorgio Sartor, Dan Williams, Shaohua Li, Neil Brown,
	linux-raid, Song Liu, Kernel-team

On 01/04/15 21:04, Alireza Haghdoost wrote:
> On Wed, Apr 1, 2015 at 2:57 PM, Wols Lists <antlists@youngman.org.uk> wrote:
>> On 01/04/15 19:46, Alireza Haghdoost wrote:
>>>> Now, how can be assured, in that case, that the "cache"
>>>>> device is safe after the power is restored?
>>> You do sync write-ahead logging on the Flash cache. If it return
>>> successful, you do fire the writes to the RAID. If system crash/fails
>>> during the RAID writes (Write-hole), you just recover data by scanning
>>> write-ahead log in the flash cache and replay the logs into the RAID
>>> drives.
>>>
>> Just to throw something nasty into the mix, I'm not sure whether it's
>> SSDs or SD-cards, but there certainly *was* a spate of corrupted
>> *controllers*.
>>
>> In other words, a power failure would RELIABLY TRASH the device, if it
>> happened at the wrong moment. Hopefully that's been fixed ...
>>
> 
> That is certainly true. As Dan mentioned, the cache device it-self
> should be safe against power failure. I agree this is not the case for
> all SSD cards in the market but might be the case for Facebook. I hate
> to say this but It seems these efforts are useful dependent to what
> kind of hardware is deployed for cache device.
> 
It would be nice, but probably not possible, to have some form of
black-list of "these devices are unsafe/dangerous". Along the lines of
"mdadm --probe /dev/sda" or whatever, that gets the device type, checks
it, and says "this SSD can be destroyed by a power failure" or "this is
a cheap disk with the timeout problem" or something. But even if someone
did it, the database would probably bit-rot fairly quickly :-(

Cheers,
Wol


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-03-30 22:25 [RFC] raid5: add a log device to fix raid5/6 write hole issue Shaohua Li
  2015-04-01  3:47 ` Dan Williams
@ 2015-04-01 21:53 ` NeilBrown
  2015-04-01 23:40   ` Shaohua Li
  1 sibling, 1 reply; 22+ messages in thread
From: NeilBrown @ 2015-04-01 21:53 UTC (permalink / raw)
  To: Shaohua Li; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

[-- Attachment #1: Type: text/plain, Size: 5690 bytes --]

On Mon, 30 Mar 2015 15:25:17 -0700 Shaohua Li <shli@fb.com> wrote:

> This is my attempt to fix raid5/6 write hole issue, it's not for merge
> yet, I post it out for comments. Any comments and suggestions are
> welcome!
> 
> Thanks,
> Shaohua
> 
> We expect a completed raid5/6 stack with reliability and high
> performance. Currently raid5/6 has 2 issues:
> 
> 1. read-modify-write for small size IO. To fix this issue, a cache layer
> above raid5/6 can be used to aggregate write to full stripe write.
> 2. write hole issue. A write log below raid5/6 can fix the issue.
> 
> We plan to use a SSD to fix the two issues. Here we just fix the write
> hole issue.
> 
> 1. We don't try to fix the issues together. A cache layer will do write
> acceleration. A log layer will fix write hole. The seperation will
> simplify things a lot.
> 
> 2. Current assumption is flashcache/bcache will be used as the cache
> layer. If they don't work well, we can fix them or add a simple cache
> layer for raid write aggregation later. We also assume cache layer will
> absorb write, so log doesn't worry about write latency.
> 
> 3. For log, write will hit to log disk first, then raid disks, and
> finally IO completion is reported. An optimal way is to report IO
> completion just after IO hits to log disk to cut write latency. But in
> that way, read path need query log disk and increase complexity. And
> since we don't worry about write latency, we choose a simple soltuion.
> This will be revisited if there is performance issue.
> 
> This design isn't intrusive for raid5/6. Actully only very few changes
> of existing code is required.
> 
> Log looks like jbd. Stripe IO to raid disks will be written to log disk
> first in atomic way. Several stripe IO will consist a transaction. If
> all stripes of a transaction are finished, the tranaction can be
> checkpoint.
> 
> Basic logic of raid 5/6 write will be:
> 1. normal raid5/6 steps for a stripe (fetch data, calculate checksum,
> and etc). log hooks to ops_run_io.
> 2. stripe is added to a transaction. Write stripe data to log disk (metadata
> block, stripe data)
> 3. write commit block to log disk
> 4. flush log disk cache.
> 5. stripe is logged now and normal stripe handling continues
> 
> Transaction checkpoint process:
> 1. all stripes of a transaction are finished
> 2. flush disk cache of all raid disks
> 3. change log super to reflect new log checkpoint position
> 4. WRITE_FUA log super
> 
> metadata, data and commit block IO can run in the meaning time, as
> checksum will be used to make sure their data is correct (like jbd2).
> Log IO doesn't wait 5s to start like jbd, instead the IO will start
> every time a metadata block is full. This can cut some latency.
> 
> Disk layout:
> 
> |super|metadata|data|metadata| data ... |commitdata|metadata|data| ... |commitdata|
> super, metadata, commit will use one block
> 
> This is an initial version, which works but a lot of stuffes are
> missing:
> 1. error handling
> 2. log recovery and impact to raid resync (don't need resync anymore)
> 3. utility changes
> 
> The big question is how we report log disk. In this patch, I simply use
> a spare disk for testing. We need a new raid disk role for log disk.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>


Hi,
 thanks for the proposal and the patch which makes it nice and concrete...

 I should start out by saying that I'm not really sold on the importance of
 the issues you are addressing here.
 The "write hole" is certainly of theoretical significance, but I do wonder
 how much practical significance it has.  It can only be a problem if you
 have a system failure and a degraded array at the same time, and both of
 those should be very rare event individually...  
 I wonder if anyone has *ever* lost data to the "write hole".

 As for write-ahead caching to reduce latency, most writes from Linux are
 async and so would not benefit from that.  If you do have a heavily
 synchronous write load, then that can be fixed in the filesystem.
 e.g. with ext3 and an external log to a low-latency device you can get
 low-latency writes which largely mask the latency issues introduced by
 RAID5.

 The fact that I'm "not really sold" doesn't mean I am against them ... maybe
 it is just an encouragement for someone to sell them more :-)

 While I understand that keeping the two separate might simplify the
 problem, I'm not at all sure it is a good idea.  It would mean that every
 data block were written three times - once to the write-ahead log, once to
 the write-hole-protection log, and once to the RAID5.

 Your code does avoid write-hole-protection for fill-stripe-writes, and this
 would greatly reduce the  number of block that were written multiple times.
 However I'm not convinced that is correct.
 A reasonable  goal is that if the system crashes while writing to a storage
 device, then reads should return the old data or not new data, not anything
 else.  A crash in the middle of a full-stripe-write to a degraded array
 could result in some block in the stripe appearing to contain data that is
 different to both the old and the new.  If you are going to close the whole,
 I think it should be done properly.

 A combined log would "simply" involve writing every data block and  every
 compute parity block (with index information) to the log device.
 Replaying the log would collect data blocks and flush out those in a stripe
 once the parity block(s) for that stripe became available.

 I think this would actually turn into a fairly simple logging mechanism.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 21:53 ` NeilBrown
@ 2015-04-01 23:40   ` Shaohua Li
  2015-04-02  0:19     ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-01 23:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

On Thu, Apr 02, 2015 at 08:53:12AM +1100, NeilBrown wrote:
> On Mon, 30 Mar 2015 15:25:17 -0700 Shaohua Li <shli@fb.com> wrote:
> 
> > This is my attempt to fix raid5/6 write hole issue, it's not for merge
> > yet, I post it out for comments. Any comments and suggestions are
> > welcome!
> > 
> > Thanks,
> > Shaohua
> > 
> > We expect a completed raid5/6 stack with reliability and high
> > performance. Currently raid5/6 has 2 issues:
> > 
> > 1. read-modify-write for small size IO. To fix this issue, a cache layer
> > above raid5/6 can be used to aggregate write to full stripe write.
> > 2. write hole issue. A write log below raid5/6 can fix the issue.
> > 
> > We plan to use a SSD to fix the two issues. Here we just fix the write
> > hole issue.
> > 
> > 1. We don't try to fix the issues together. A cache layer will do write
> > acceleration. A log layer will fix write hole. The seperation will
> > simplify things a lot.
> > 
> > 2. Current assumption is flashcache/bcache will be used as the cache
> > layer. If they don't work well, we can fix them or add a simple cache
> > layer for raid write aggregation later. We also assume cache layer will
> > absorb write, so log doesn't worry about write latency.
> > 
> > 3. For log, write will hit to log disk first, then raid disks, and
> > finally IO completion is reported. An optimal way is to report IO
> > completion just after IO hits to log disk to cut write latency. But in
> > that way, read path need query log disk and increase complexity. And
> > since we don't worry about write latency, we choose a simple soltuion.
> > This will be revisited if there is performance issue.
> > 
> > This design isn't intrusive for raid5/6. Actully only very few changes
> > of existing code is required.
> > 
> > Log looks like jbd. Stripe IO to raid disks will be written to log disk
> > first in atomic way. Several stripe IO will consist a transaction. If
> > all stripes of a transaction are finished, the tranaction can be
> > checkpoint.
> > 
> > Basic logic of raid 5/6 write will be:
> > 1. normal raid5/6 steps for a stripe (fetch data, calculate checksum,
> > and etc). log hooks to ops_run_io.
> > 2. stripe is added to a transaction. Write stripe data to log disk (metadata
> > block, stripe data)
> > 3. write commit block to log disk
> > 4. flush log disk cache.
> > 5. stripe is logged now and normal stripe handling continues
> > 
> > Transaction checkpoint process:
> > 1. all stripes of a transaction are finished
> > 2. flush disk cache of all raid disks
> > 3. change log super to reflect new log checkpoint position
> > 4. WRITE_FUA log super
> > 
> > metadata, data and commit block IO can run in the meaning time, as
> > checksum will be used to make sure their data is correct (like jbd2).
> > Log IO doesn't wait 5s to start like jbd, instead the IO will start
> > every time a metadata block is full. This can cut some latency.
> > 
> > Disk layout:
> > 
> > |super|metadata|data|metadata| data ... |commitdata|metadata|data| ... |commitdata|
> > super, metadata, commit will use one block
> > 
> > This is an initial version, which works but a lot of stuffes are
> > missing:
> > 1. error handling
> > 2. log recovery and impact to raid resync (don't need resync anymore)
> > 3. utility changes
> > 
> > The big question is how we report log disk. In this patch, I simply use
> > a spare disk for testing. We need a new raid disk role for log disk.
> > 
> > Signed-off-by: Shaohua Li <shli@fb.com>
> 
> 
> Hi,
>  thanks for the proposal and the patch which makes it nice and concrete...
> 
>  I should start out by saying that I'm not really sold on the importance of
>  the issues you are addressing here.
>  The "write hole" is certainly of theoretical significance, but I do wonder
>  how much practical significance it has.  It can only be a problem if you
>  have a system failure and a degraded array at the same time, and both of
>  those should be very rare event individually...  
>  I wonder if anyone has *ever* lost data to the "write hole".

We have tens of thousands of machines. Rare event in terms of huge amount
of machines become a normal event :). It's not a significant issue, but
we don't want to take the risk.

>  As for write-ahead caching to reduce latency, most writes from Linux are
>  async and so would not benefit from that.  If you do have a heavily
>  synchronous write load, then that can be fixed in the filesystem.
>  e.g. with ext3 and an external log to a low-latency device you can get
>  low-latency writes which largely mask the latency issues introduced by
>  RAID5.

Maybe I should write more about the caching, I didn't because the patch
is about the write hole issue. Any way, the side effect of caching is
write-hole-protection log doesn't need worry about latency. The main
purpose of caching is to produce full stripe write or reduce
read-modify-write if full stripe write is not possible. The caching can
also reduce hard disk spindle seek, because we can sort when flush data
from caching to raid.

>  The fact that I'm "not really sold" doesn't mean I am against them ... maybe
>  it is just an encouragement for someone to sell them more :-)
> 
>  While I understand that keeping the two separate might simplify the
>  problem, I'm not at all sure it is a good idea.  It would mean that every
>  data block were written three times - once to the write-ahead log, once to
>  the write-hole-protection log, and once to the RAID5.

yes, it the write-ahead log and write-hole-protection log are combined,
one write can be avoid.

>  Your code does avoid write-hole-protection for fill-stripe-writes, and this
>  would greatly reduce the  number of block that were written multiple times.
>  However I'm not convinced that is correct.
>  A reasonable  goal is that if the system crashes while writing to a storage
>  device, then reads should return the old data or not new data, not anything
>  else.  A crash in the middle of a full-stripe-write to a degraded array
>  could result in some block in the stripe appearing to contain data that is
>  different to both the old and the new.  If you are going to close the whole,
>  I think it should be done properly.

I can do it simpley. But don't think this assumption is true. If you
write to a disk range and there is failure, there is nothing guarantee
you can either read old data or new data.

> 
>  A combined log would "simply" involve writing every data block and  every
>  compute parity block (with index information) to the log device.
>  Replaying the log would collect data blocks and flush out those in a stripe
>  once the parity block(s) for that stripe became available.
> 
>  I think this would actually turn into a fairly simple logging mechanism.

It's not simple at all. It's unlikely we write data and parity
continuously in disk and in the same time. This will make log checkpoint
fairly complex.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-01 23:40   ` Shaohua Li
@ 2015-04-02  0:19     ` NeilBrown
  2015-04-02  4:07       ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2015-04-02  0:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

[-- Attachment #1: Type: text/plain, Size: 3985 bytes --]

On Wed, 1 Apr 2015 16:40:57 -0700 Shaohua Li <shli@fb.com> wrote:


> >  Your code does avoid write-hole-protection for fill-stripe-writes, and this
> >  would greatly reduce the  number of block that were written multiple times.
> >  However I'm not convinced that is correct.
> >  A reasonable  goal is that if the system crashes while writing to a storage
> >  device, then reads should return the old data or not new data, not anything
> >  else.  A crash in the middle of a full-stripe-write to a degraded array
> >  could result in some block in the stripe appearing to contain data that is
> >  different to both the old and the new.  If you are going to close the whole,
> >  I think it should be done properly.
> 
> I can do it simpley. But don't think this assumption is true. If you
> write to a disk range and there is failure, there is nothing guarantee
> you can either read old data or new data.

If you write a range of blocks to a normal disk and crash during the write,
each block will contain either the old data or the new data.
If you write a range to a degraded RAID5 and crash during the write, you
cannot make that same guarantee.
I don't know how important this is, but then I don't really know how
important any of this is.

> 
> > 
> >  A combined log would "simply" involve writing every data block and  every
> >  compute parity block (with index information) to the log device.
> >  Replaying the log would collect data blocks and flush out those in a stripe
> >  once the parity block(s) for that stripe became available.
> > 
> >  I think this would actually turn into a fairly simple logging mechanism.
> 
> It's not simple at all. It's unlikely we write data and parity
> continuously in disk and in the same time. This will make log checkpoint
> fairly complex.

I don't see any cause for complexity.  Let me be more explicit.

I imagine that all data remains in the stripe cache, in memory, until it is
finally written to the RAID5.  So the stripe cache will need to be quite a
bit bigger.

Every time we get a block that we want to write, either a new data block or a
a computed parity block, we queue it to the log.

The log works like this:
 - take the first (e.g.) 256 blocks in the queue, create a header to describe
   them, write the header with FUA, then write all the data blocks.  If there
   are fewer than 256, just write what we have.
 - when the header write completes, all blocks written *previously* are now
   safe and we can call bio_end_io on data or unlock the stripe for parity.
 - loop back and write some more blocks.  If there are no blocks to write,
   write a header which describes an empty set of blocks, and wait for more
   blocks to appear.


Each stripe_head needs to track (roughly) where the relevant blocks were
written so it can release them when the stripe is written.
I would conceptually divide the log into 32 regions and keep a 32bit number
with each stripe.  When a block is assigned to a region in the log, the
relevant bit is set for the stripe, and a per-region counter is incremented.
When a stripe completes its write, the region counters for all the bits are 
cleared.  The log cannot progress into a region which has a non-zero counter.

We choose the size of transactions so that the first block of each region is
a header block.  These contain a magic number, a sequence number, and a
checksum together with the addresses of the data/parity blocks.  On restart
we read all 32 of these to find out where the log starts and ends.  Then we
replay all the blocks into the stripe cache - discarding any that don't come
with the required parity blocks.

So it is a very simple log which is never read exact on crash recovery.  It
commits everything ASAP so that the writeout to the array can be lazy and can
gather related blocks and sort address etc with not impact on filesystem
latency.

Does that make sense?

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-02  0:19     ` NeilBrown
@ 2015-04-02  4:07       ` Shaohua Li
  2015-04-09  0:43         ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-02  4:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

On Thu, Apr 02, 2015 at 11:19:41AM +1100, NeilBrown wrote:
> On Wed, 1 Apr 2015 16:40:57 -0700 Shaohua Li <shli@fb.com> wrote:
> 
> 
> > >  Your code does avoid write-hole-protection for fill-stripe-writes, and this
> > >  would greatly reduce the  number of block that were written multiple times.
> > >  However I'm not convinced that is correct.
> > >  A reasonable  goal is that if the system crashes while writing to a storage
> > >  device, then reads should return the old data or not new data, not anything
> > >  else.  A crash in the middle of a full-stripe-write to a degraded array
> > >  could result in some block in the stripe appearing to contain data that is
> > >  different to both the old and the new.  If you are going to close the whole,
> > >  I think it should be done properly.
> > 
> > I can do it simpley. But don't think this assumption is true. If you
> > write to a disk range and there is failure, there is nothing guarantee
> > you can either read old data or new data.
> 
> If you write a range of blocks to a normal disk and crash during the write,
> each block will contain either the old data or the new data.
> If you write a range to a degraded RAID5 and crash during the write, you
> cannot make that same guarantee.
> I don't know how important this is, but then I don't really know how
> important any of this is.
> 
> > 
> > > 
> > >  A combined log would "simply" involve writing every data block and  every
> > >  compute parity block (with index information) to the log device.
> > >  Replaying the log would collect data blocks and flush out those in a stripe
> > >  once the parity block(s) for that stripe became available.
> > > 
> > >  I think this would actually turn into a fairly simple logging mechanism.
> > 
> > It's not simple at all. It's unlikely we write data and parity
> > continuously in disk and in the same time. This will make log checkpoint
> > fairly complex.
> 
> I don't see any cause for complexity.  Let me be more explicit.
> 
> I imagine that all data remains in the stripe cache, in memory, until it is
> finally written to the RAID5.  So the stripe cache will need to be quite a
> bit bigger.
> 
> Every time we get a block that we want to write, either a new data block or a
> a computed parity block, we queue it to the log.
> 
> The log works like this:
>  - take the first (e.g.) 256 blocks in the queue, create a header to describe
>    them, write the header with FUA, then write all the data blocks.  If there
>    are fewer than 256, just write what we have.
>  - when the header write completes, all blocks written *previously* are now
>    safe and we can call bio_end_io on data or unlock the stripe for parity.
>  - loop back and write some more blocks.  If there are no blocks to write,
>    write a header which describes an empty set of blocks, and wait for more
>    blocks to appear.

Ok, this is similar.

> Each stripe_head needs to track (roughly) where the relevant blocks were
> written so it can release them when the stripe is written.
> I would conceptually divide the log into 32 regions and keep a 32bit number
> with each stripe.  When a block is assigned to a region in the log, the
> relevant bit is set for the stripe, and a per-region counter is incremented.
> When a stripe completes its write, the region counters for all the bits are 
> cleared.  The log cannot progress into a region which has a non-zero counter.

I like this region idea very much. Previously I thought the combined log
is complex because data and parity are not in adjacent disk location,
and can cause fragement, so make checkpoint complex. The region
effectively solves the problem, but a big size region would still have
the fragement issue. We can divide the disk to a lot of equal sized
regions, the region size could be 4k*raid_disks*2 for example. Each
region is a log for exact one stripe. Write will append data to such
log and write is finished. parity append to the log too and then the
region is considered settled down. The downside is meta will use 1
sector even just several bytes are required and this will produce a lot
of small size IO too.

I'm not enthusiastic to use stripe cache though, we can't keep all data
in stripe cache. What we really need is an index.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-02  4:07       ` Shaohua Li
@ 2015-04-09  0:43         ` Shaohua Li
  2015-04-09  5:04           ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-09  0:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

Hi,
This is what I'm working on now, and hopefully had the basic code
running next week. The new design will do cache and fix the write hole
issue too. Before I post the code out, I'd like to check if the design
has obvious issues.

Thanks,
Shaohua

Main goal is to aggregate write IO to hopefully make full stripe IO and fix the
write hole issue. This might speed up read too, but it's not optimized for
read, eg, we don't proactivly cache data for read. The aggregation makes a lot
of sense for workloads which sequentially write to several files. Such
workloads are popular in today's datacenter.

Here cache = cache disk, generally SSD. raid = raid array or raid disks
(excluding cache disk)
-------------------------
cache layout will like this:

|super|chunk descriptor|chunk data|

We divide cache to equal sized chunks. each chunk will have a descriptor.
Its size will be raid_chunk_size * raid_disks. That is the cache chunk can
store a whole raid chunk data and parity.

Write IO will store to cache chunks first and then flush to raid chunks. We use
fixed size chunk:
-manage cache space easily. We don't need a complex tree-like index

-flush data from cache to raid easily. data and parity are in the same chunk

-reclaim space is easy. when there is no free chunk in cache, we must try to
free some chunks, eg, reclaim. We do reclaim in chunk unit. reclaim a chunk
just means flush the chunk from cache to raid. If we use complex data
structure, we will need garbage collection and so on.

-The downside is we waste space. Eg, a single 4k data will use a whole chunk in
cache. But we can reclaim chunks with low utilization quickly to mitgate this
issue partially.

--------------------
chunk descriptor looks like this:
chunk_desc {
	u64 seq;
	u64 raid_chunk_index;
	u32 state;
	u8 bitmaps[];
}

seq: seq can be used to implement LRU-like algorithm for chunk reclaim. Every
time data is written to the chunk, we update the chunk's seq. When we flush a
chunk from cache to raid, we freeze the chunk (eg, the chunk can't accept new
IO). If there is new IO, we write the new IO to another chunk. The new chunk
will have a bigger seq than original chunk. crash and reboot can use the seq to
detinguish which chunk is newer.

raid_chunk_index: where the chunk should be flushed to raid

state: chunk state. Currently I defined 3 states
-FREE, the chunk is free
-RUNNING, the chunk maps to raid chunk and accepts new IO
-PARITY_INCORE, the chunk has both data and parity stored in cache

bitmaps: each page of data and parity has one bit. 1 means present. Store data
bits first.

-----IO READ PATH------
IO READ will check each chunk desc. If data is present in cache, dispatch to
cache. otherwise to raid.

-----IO WRITE PATH------
1. find or create a chunk in cache
2. write to cache
3. write descriptor

We write descriptor immediately in asynchronous way to reduce data loss, the
chunk will be RUNNING state.

-For normal write, IO return after 2. This will cut latency too. If there is a
crash, the chunk state might be FREE or bitmap isn't set. In either case, this
is the first write to the chunk, IO READ will read raid and get old data. We
meet the symantics. If data isn't in cache, we will read old data in cache, we
meet the symantics too.

-For FUA write, 2 will be a FUA write. When 2 finishes, run 3 with FUA. IO
return after 3. Crash after IO return deosn't impact symantics. We will read
old or new data if crash happens before IO return, which is the similar like
the normal write case.

-For FLUSH, wait all previous descriptor write finish and then flush cache disk
cache. In this way, we guarantee all previous write hit cache.

-----chunk reclaim--------
1. select a chunk
2. freeze the chunk
3. copy chunk data from cache to raid, so stripe state machine runs, eg,
calculate parity and so on
4. Hook to raid5 run_io. We write parity to cache
5. flush cache disk cache
6. mark descriptor PARITY_INCORE, and WRITE_FUA to cache
7. raid5 run_io continue run. data and parity write to raid disks
8. flush all raid disk cache
9. mark descriptor FREE, WRITE_FUA to cache

We will batch several chunks for reclaim for better performance. FUA write can
be replaced with FLUSH too.

If there is a crash before 6, descriptor state will be RUNNING. Recovery just
need discard the parity bitmap. If there is a crash before 9, descriptor state
will be PARITY_INCORE, recovery must copy both data and parity to raid.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-09  0:43         ` Shaohua Li
@ 2015-04-09  5:04           ` NeilBrown
  2015-04-09  6:15             ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2015-04-09  5:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

[-- Attachment #1: Type: text/plain, Size: 6718 bytes --]

On Wed, 8 Apr 2015 17:43:11 -0700 Shaohua Li <shli@fb.com> wrote:

> Hi,
> This is what I'm working on now, and hopefully had the basic code
> running next week. The new design will do cache and fix the write hole
> issue too. Before I post the code out, I'd like to check if the design
> has obvious issues.

I can't say I'm excited about it....

You still haven't explained why you would ever want to read data from the
"cache"?  Why not just keep everything in the stripe-cache until it is safe
in the RAID.   I asked before and you said:

>> I'm not enthusiastic to use stripe cache though, we can't keep all data
>> in stripe cache. What we really need is an index.

which is hardly an answer.  Why cannot you keep all the data in the stripe
cache?  How much data is there? How much memory can you afford to dedicate?

You must have some very long sustained bursts of writes which are much faster
than the RAID can accept in order to not be able to keep everything in memory.


Your cache layout seems very rigid.  I would much rather a layout that was
very general and flexible.  If you want to always allocate a chunk at a time
then fine, but don't force that on the cache layout.

The log really should be very simple.  A block describing what comes next,
then lots of data/parity.  Then another block and more data etc etc.
Each metadata  block points to the next one.
If you need an index of the cache, you keep that in memory.  On restart, you
read all of the metadata blocks and  built up the index.

I think that space in the log should be reclaimed in exactly the order that
it is written, so the active part of the log is contiguous.   Obviously
individual blocks become inactive in arbitrary order as they are written to
the RAID, but each extent of the log becomes free in order.
If you want that to happen out of order, you would need to present a very
good reason.

Best to start as simple as possible....


NeilBrown




> 
> Thanks,
> Shaohua
> 
> Main goal is to aggregate write IO to hopefully make full stripe IO and fix the
> write hole issue. This might speed up read too, but it's not optimized for
> read, eg, we don't proactivly cache data for read. The aggregation makes a lot
> of sense for workloads which sequentially write to several files. Such
> workloads are popular in today's datacenter.
> 
> Here cache = cache disk, generally SSD. raid = raid array or raid disks
> (excluding cache disk)
> -------------------------
> cache layout will like this:
> 
> |super|chunk descriptor|chunk data|
> 
> We divide cache to equal sized chunks. each chunk will have a descriptor.
> Its size will be raid_chunk_size * raid_disks. That is the cache chunk can
> store a whole raid chunk data and parity.
> 
> Write IO will store to cache chunks first and then flush to raid chunks. We use
> fixed size chunk:
> -manage cache space easily. We don't need a complex tree-like index
> 
> -flush data from cache to raid easily. data and parity are in the same chunk
> 
> -reclaim space is easy. when there is no free chunk in cache, we must try to
> free some chunks, eg, reclaim. We do reclaim in chunk unit. reclaim a chunk
> just means flush the chunk from cache to raid. If we use complex data
> structure, we will need garbage collection and so on.
> 
> -The downside is we waste space. Eg, a single 4k data will use a whole chunk in
> cache. But we can reclaim chunks with low utilization quickly to mitgate this
> issue partially.
> 
> --------------------
> chunk descriptor looks like this:
> chunk_desc {
> 	u64 seq;
> 	u64 raid_chunk_index;
> 	u32 state;
> 	u8 bitmaps[];
> }
> 
> seq: seq can be used to implement LRU-like algorithm for chunk reclaim. Every
> time data is written to the chunk, we update the chunk's seq. When we flush a
> chunk from cache to raid, we freeze the chunk (eg, the chunk can't accept new
> IO). If there is new IO, we write the new IO to another chunk. The new chunk
> will have a bigger seq than original chunk. crash and reboot can use the seq to
> detinguish which chunk is newer.
> 
> raid_chunk_index: where the chunk should be flushed to raid
> 
> state: chunk state. Currently I defined 3 states
> -FREE, the chunk is free
> -RUNNING, the chunk maps to raid chunk and accepts new IO
> -PARITY_INCORE, the chunk has both data and parity stored in cache
> 
> bitmaps: each page of data and parity has one bit. 1 means present. Store data
> bits first.
> 
> -----IO READ PATH------
> IO READ will check each chunk desc. If data is present in cache, dispatch to
> cache. otherwise to raid.
> 
> -----IO WRITE PATH------
> 1. find or create a chunk in cache
> 2. write to cache
> 3. write descriptor
> 
> We write descriptor immediately in asynchronous way to reduce data loss, the
> chunk will be RUNNING state.
> 
> -For normal write, IO return after 2. This will cut latency too. If there is a
> crash, the chunk state might be FREE or bitmap isn't set. In either case, this
> is the first write to the chunk, IO READ will read raid and get old data. We
> meet the symantics. If data isn't in cache, we will read old data in cache, we
> meet the symantics too.
> 
> -For FUA write, 2 will be a FUA write. When 2 finishes, run 3 with FUA. IO
> return after 3. Crash after IO return deosn't impact symantics. We will read
> old or new data if crash happens before IO return, which is the similar like
> the normal write case.
> 
> -For FLUSH, wait all previous descriptor write finish and then flush cache disk
> cache. In this way, we guarantee all previous write hit cache.
> 
> -----chunk reclaim--------
> 1. select a chunk
> 2. freeze the chunk
> 3. copy chunk data from cache to raid, so stripe state machine runs, eg,
> calculate parity and so on
> 4. Hook to raid5 run_io. We write parity to cache
> 5. flush cache disk cache
> 6. mark descriptor PARITY_INCORE, and WRITE_FUA to cache
> 7. raid5 run_io continue run. data and parity write to raid disks
> 8. flush all raid disk cache
> 9. mark descriptor FREE, WRITE_FUA to cache
> 
> We will batch several chunks for reclaim for better performance. FUA write can
> be replaced with FLUSH too.
> 
> If there is a crash before 6, descriptor state will be RUNNING. Recovery just
> need discard the parity bitmap. If there is a crash before 9, descriptor state
> will be PARITY_INCORE, recovery must copy both data and parity to raid.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-09  5:04           ` NeilBrown
@ 2015-04-09  6:15             ` Shaohua Li
  2015-04-09 15:37               ` Dan Williams
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-09  6:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: dan.j.williams, linux-raid, songliubraving, Kernel-team

On Thu, Apr 09, 2015 at 03:04:59PM +1000, NeilBrown wrote:
> On Wed, 8 Apr 2015 17:43:11 -0700 Shaohua Li <shli@fb.com> wrote:
> 
> > Hi,
> > This is what I'm working on now, and hopefully had the basic code
> > running next week. The new design will do cache and fix the write hole
> > issue too. Before I post the code out, I'd like to check if the design
> > has obvious issues.
> 
> I can't say I'm excited about it....
> 
> You still haven't explained why you would ever want to read data from the
> "cache"?  Why not just keep everything in the stripe-cache until it is safe
> in the RAID.   I asked before and you said:
> 
> >> I'm not enthusiastic to use stripe cache though, we can't keep all data
> >> in stripe cache. What we really need is an index.
> 
> which is hardly an answer.  Why cannot you keep all the data in the stripe
> cache?  How much data is there? How much memory can you afford to dedicate?
> 
> You must have some very long sustained bursts of writes which are much faster
> than the RAID can accept in order to not be able to keep everything in memory.
> 
> 
> Your cache layout seems very rigid.  I would much rather a layout that was
> very general and flexible.  If you want to always allocate a chunk at a time
> then fine, but don't force that on the cache layout.
> 
> The log really should be very simple.  A block describing what comes next,
> then lots of data/parity.  Then another block and more data etc etc.
> Each metadata  block points to the next one.
> If you need an index of the cache, you keep that in memory.  On restart, you
> read all of the metadata blocks and  built up the index.
> 
> I think that space in the log should be reclaimed in exactly the order that
> it is written, so the active part of the log is contiguous.   Obviously
> individual blocks become inactive in arbitrary order as they are written to
> the RAID, but each extent of the log becomes free in order.
> If you want that to happen out of order, you would need to present a very
> good reason.

I came to the same idea when I'm thinking about a caching layer, but the
memory size is the main blocking issue. If the solution requires a large
amount of extra memory, it's not cost effective, so a hard sell to
replace hardware raid with software raid. The design completely depends
on if we can store all data in memory. I don't have an anwser yet how
much memory we should use to make the aggregation efficient. Guess only
number can talk. I'll try to collect some data and get back to you.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-09  6:15             ` Shaohua Li
@ 2015-04-09 15:37               ` Dan Williams
  2015-04-09 16:03                 ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2015-04-09 15:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, Song Liu, Kernel-team

On Wed, Apr 8, 2015 at 11:15 PM, Shaohua Li <shli@fb.com> wrote:
> On Thu, Apr 09, 2015 at 03:04:59PM +1000, NeilBrown wrote:
>> On Wed, 8 Apr 2015 17:43:11 -0700 Shaohua Li <shli@fb.com> wrote:
>>
>> > Hi,
>> > This is what I'm working on now, and hopefully had the basic code
>> > running next week. The new design will do cache and fix the write hole
>> > issue too. Before I post the code out, I'd like to check if the design
>> > has obvious issues.
>>
>> I can't say I'm excited about it....
>>
>> You still haven't explained why you would ever want to read data from the
>> "cache"?  Why not just keep everything in the stripe-cache until it is safe
>> in the RAID.   I asked before and you said:
>>
>> >> I'm not enthusiastic to use stripe cache though, we can't keep all data
>> >> in stripe cache. What we really need is an index.
>>
>> which is hardly an answer.  Why cannot you keep all the data in the stripe
>> cache?  How much data is there? How much memory can you afford to dedicate?
>>
>> You must have some very long sustained bursts of writes which are much faster
>> than the RAID can accept in order to not be able to keep everything in memory.
>>
>>
>> Your cache layout seems very rigid.  I would much rather a layout that was
>> very general and flexible.  If you want to always allocate a chunk at a time
>> then fine, but don't force that on the cache layout.
>>
>> The log really should be very simple.  A block describing what comes next,
>> then lots of data/parity.  Then another block and more data etc etc.
>> Each metadata  block points to the next one.
>> If you need an index of the cache, you keep that in memory.  On restart, you
>> read all of the metadata blocks and  built up the index.
>>
>> I think that space in the log should be reclaimed in exactly the order that
>> it is written, so the active part of the log is contiguous.   Obviously
>> individual blocks become inactive in arbitrary order as they are written to
>> the RAID, but each extent of the log becomes free in order.
>> If you want that to happen out of order, you would need to present a very
>> good reason.
>
> I came to the same idea when I'm thinking about a caching layer, but the
> memory size is the main blocking issue. If the solution requires a large
> amount of extra memory, it's not cost effective, so a hard sell to
> replace hardware raid with software raid. The design completely depends
> on if we can store all data in memory. I don't have an anwser yet how
> much memory we should use to make the aggregation efficient. Guess only
> number can talk. I'll try to collect some data and get back to you.
>

Another consideration to keep in mind is persistent memory.  I'm
working on an in-kernel mechanism to claim and map pmem and a
raid-write-cache is an obvious first application.  I'll include you on
the initial submission of that capability.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
  2015-04-09 15:37               ` Dan Williams
@ 2015-04-09 16:03                 ` Shaohua Li
  0 siblings, 0 replies; 22+ messages in thread
From: Shaohua Li @ 2015-04-09 16:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, Song Liu, Kernel-team

On Thu, Apr 09, 2015 at 08:37:03AM -0700, Dan Williams wrote:
> On Wed, Apr 8, 2015 at 11:15 PM, Shaohua Li <shli@fb.com> wrote:
> > On Thu, Apr 09, 2015 at 03:04:59PM +1000, NeilBrown wrote:
> >> On Wed, 8 Apr 2015 17:43:11 -0700 Shaohua Li <shli@fb.com> wrote:
> >>
> >> > Hi,
> >> > This is what I'm working on now, and hopefully had the basic code
> >> > running next week. The new design will do cache and fix the write hole
> >> > issue too. Before I post the code out, I'd like to check if the design
> >> > has obvious issues.
> >>
> >> I can't say I'm excited about it....
> >>
> >> You still haven't explained why you would ever want to read data from the
> >> "cache"?  Why not just keep everything in the stripe-cache until it is safe
> >> in the RAID.   I asked before and you said:
> >>
> >> >> I'm not enthusiastic to use stripe cache though, we can't keep all data
> >> >> in stripe cache. What we really need is an index.
> >>
> >> which is hardly an answer.  Why cannot you keep all the data in the stripe
> >> cache?  How much data is there? How much memory can you afford to dedicate?
> >>
> >> You must have some very long sustained bursts of writes which are much faster
> >> than the RAID can accept in order to not be able to keep everything in memory.
> >>
> >>
> >> Your cache layout seems very rigid.  I would much rather a layout that was
> >> very general and flexible.  If you want to always allocate a chunk at a time
> >> then fine, but don't force that on the cache layout.
> >>
> >> The log really should be very simple.  A block describing what comes next,
> >> then lots of data/parity.  Then another block and more data etc etc.
> >> Each metadata  block points to the next one.
> >> If you need an index of the cache, you keep that in memory.  On restart, you
> >> read all of the metadata blocks and  built up the index.
> >>
> >> I think that space in the log should be reclaimed in exactly the order that
> >> it is written, so the active part of the log is contiguous.   Obviously
> >> individual blocks become inactive in arbitrary order as they are written to
> >> the RAID, but each extent of the log becomes free in order.
> >> If you want that to happen out of order, you would need to present a very
> >> good reason.
> >
> > I came to the same idea when I'm thinking about a caching layer, but the
> > memory size is the main blocking issue. If the solution requires a large
> > amount of extra memory, it's not cost effective, so a hard sell to
> > replace hardware raid with software raid. The design completely depends
> > on if we can store all data in memory. I don't have an anwser yet how
> > much memory we should use to make the aggregation efficient. Guess only
> > number can talk. I'll try to collect some data and get back to you.
> >
> 
> Another consideration to keep in mind is persistent memory.  I'm
> working on an in-kernel mechanism to claim and map pmem and a
> raid-write-cache is an obvious first application.  I'll include you on
> the initial submission of that capability.

Exactly, we are planing to use pmem in the future when it's mature and
popular. SSD is still the best option before pmem is popular and widely
used.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-04-09 16:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-30 22:25 [RFC] raid5: add a log device to fix raid5/6 write hole issue Shaohua Li
2015-04-01  3:47 ` Dan Williams
2015-04-01  5:53   ` Shaohua Li
2015-04-01  6:02     ` NeilBrown
2015-04-01 17:14       ` Shaohua Li
2015-04-01 18:36   ` Piergiorgio Sartor
2015-04-01 18:46     ` Dan Williams
2015-04-01 20:07       ` Jiang, Dave
2015-04-01 18:46     ` Alireza Haghdoost
2015-04-01 19:57       ` Wols Lists
2015-04-01 20:04         ` Alireza Haghdoost
2015-04-01 20:18           ` Wols Lists
2015-04-01 20:17         ` Jens Axboe
2015-04-01 21:53 ` NeilBrown
2015-04-01 23:40   ` Shaohua Li
2015-04-02  0:19     ` NeilBrown
2015-04-02  4:07       ` Shaohua Li
2015-04-09  0:43         ` Shaohua Li
2015-04-09  5:04           ` NeilBrown
2015-04-09  6:15             ` Shaohua Li
2015-04-09 15:37               ` Dan Williams
2015-04-09 16:03                 ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.