* [PATCH 0/3] Device mapper log writes patches @ 2015-03-19 20:31 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests Here are my patches for adding the dm-log-writes target to the kernel and the supporting xfstests to go along with it. The dm patch has a pretty detailed documentation file to describe the methodology behind the target and how to use it. The xfstest that is generic has been tested on btrfs, xfs and ext4 and currently fails on all of them in different ways with my kernel (which is just 3.19 so don't be too alarmed). The btrfs specific one passes currently, more evil tests will come later. Basically the idea behind this target and these tests are to give our file systems a more thorough power fail scenario testing. The target logs writes in order that things would have made it safely to disk and then the tests replay that log in various ways and check the result. You can find the supporting userspace program here https://github.com/josefbacik/log-writes I apologize for it being ugly, I was trying to get it working as quickly as possible. There is an example script in there that you can use to do an exhaustive step by step through a log to make sure your file system is always consistent. Thanks, Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/3] Device mapper log writes patches @ 2015-03-19 20:31 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests Here are my patches for adding the dm-log-writes target to the kernel and the supporting xfstests to go along with it. The dm patch has a pretty detailed documentation file to describe the methodology behind the target and how to use it. The xfstest that is generic has been tested on btrfs, xfs and ext4 and currently fails on all of them in different ways with my kernel (which is just 3.19 so don't be too alarmed). The btrfs specific one passes currently, more evil tests will come later. Basically the idea behind this target and these tests are to give our file systems a more thorough power fail scenario testing. The target logs writes in order that things would have made it safely to disk and then the tests replay that log in various ways and check the result. You can find the supporting userspace program here https://github.com/josefbacik/log-writes I apologize for it being ugly, I was trying to get it working as quickly as possible. There is an example script in there that you can use to do an exhaustive step by step through a log to make sure your file system is always consistent. Thanks, Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] dm: log writes target 2015-03-19 20:31 ` Josef Bacik @ 2015-03-19 20:31 ` Josef Bacik -1 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests This creates a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and the data and log the requests and the data to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers to verify that the file system is always consistent. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- Documentation/device-mapper/dm-log-writes.txt | 136 +++++ drivers/md/Kconfig | 16 + drivers/md/Makefile | 1 + drivers/md/dm-log-writes.c | 809 ++++++++++++++++++++++++++ 4 files changed, 962 insertions(+) create mode 100644 Documentation/device-mapper/dm-log-writes.txt create mode 100644 drivers/md/dm-log-writes.c diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt new file mode 100644 index 0000000..f3a9fa2 --- /dev/null +++ b/Documentation/device-mapper/dm-log-writes.txt @@ -0,0 +1,136 @@ +dm-log-writes +============= + +This target takes 2 devices, one to pass all IO to normally, and one to log all +of the write operations to. This is intended for file system developers wishing +to verify the integrity of metadata or data as the file system is written to. +There is a log_writes_entry written for every WRITE request and the target is +able to take arbitrary data from userspace to insert into the log. The data +that is in the WRITE requests is copied into the log to make the replay happen +exactly as it happened originally. + +Log Ordering +============ + +We log things in order of completion once we are sure the write is no longer in +cache. This means that normal WRITE requests are not actually logged until the +next REQ_FLUSH request. This is to make it easier for userspace to replay the +log in a way that correlates to what is on disk and not what is in cache, to +make it easier to detect improper waiting/flushing. + +This works by attaching all WRITE requests to a list once the write completes. +Once we see a REQ_FLUSH request we splice this list onto the request and once +the FLUSH request completes we log all of the WRITE's and then the FLUSH. Only +completeled WRITEs at the time of the issue of the REQ_FLUSH are added in order +to simulate the worst case scenario with regard to power failures. Consider the +following example (W means write, C means complete) + +W1,W2,W3,C3,C2,Wflush,C1,Cflush + +The log would show the following + +W3,W2,flush,W1.... + +Again this is to simulate what is actually on disk, this allows us to detect +cases where a power failure at a particular point in time would create an +inconsistent file system. + +Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as +they complete as those requests will obviously bypass the device cache. + +Any REQ_DISCARD requests are treated like WRITE requests. This is because +otherwise we would have all the DISCARD requests, and then the WRITE requests +and then the FLUSH request. Consider the following example + +WRITE block 1, DISCARD block 1, FLUSH + +If we logged DISCARD when it completed, the replay would look like this + +DISCARD 1, WRITE 1, FLUSH + +which isn't quite what happened and wouldn't be caught during the log replay. + +Marks +===== + +You can use dmsetup to set an arbitrary mark in a log. For example say you want +to fsck an file system after every write, but first you need to replay up to the +mkfs to make sure we're fsck'ing something reasonable, you would do something +like this + +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs +<run test> + +This would allow you to replay the log up to the mkfs mark and then replay from +that point on doing the fsck check in the interval that you want. + +Every log has a mark at the end labeled "log-writes-end". + +Userspace component +=================== + +There is a userspace tool that will replay the log for you in various ways. +As of this writing the options are not well documented, they will be in the +future. It can be found here + +https://github.com/josefbacik/log-writes + +Example usage +============= + +Say you want to test fsync on your file system. You would do something like +this + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<some test that does fsync at the end> +dmsetup message log 0 mark fsync +md5sum /mnt/btrfs-test/foo +umount /mnt/btrfs-test + +dmsetup remove log +replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync +mount /dev/sdb /mnt/btrfs-test +md5sum /mnt/btrfs-test/foo +<verify md5sum's are correct> + +Another option is to do a complicated file system operation and verify the file +system is consistent during the entire operation. You could do this by doing + + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<fsstress to dirty the fs> +btrfs filesystem balance /mnt/btrfs-test +umount /mnt/btrfs-test +dmsetup remove log + +replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs +btrfsck /dev/sdb +replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ + --fsck "btrfsck /dev/sdb" --check fua + +And that will replay the log until it sees a FUA request, run the fsck command +and if the fsck passes it will replay to the next FUA, until it is completed or +the fsck command exists abnormally. + +Table Parameters +---------------- + <dev path> <dev path for log> + +Mandatory parameters: + <dev path>: Full pathname to the underlying block-device, or a "major:minor" + device-number. This device is the one that all of the IO will go + to normally, just think of it as a normal linear mapping. + <dev path for log>: Same format as <dev path>, this is the device where the + log entries are written to. + diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 63e05e3..f928ad5 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -432,4 +432,20 @@ config DM_SWITCH If unsure, say N. +config DM_LOG_WRITES + tristate "Log writes target support" + depends on BLK_DEV_DM + ---help--- + This device-mapper target takes two devices, one device to use + normally, one to log all write operations done to the first device. + This is for use by file system developers wishing to verify that + their fs is writing a consitent file system at all times by allowing + them to replay the log in a variety of ways and to check the + contents. + + To compile this code as a module, choose M here: the module will + be called dm-log-writes. + + If unsure, say N. + endif # MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index a2da532..1863fea 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -55,6 +55,7 @@ obj-$(CONFIG_DM_CACHE) += dm-cache.o obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o obj-$(CONFIG_DM_ERA) += dm-era.o +obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs += dm-uevent.o diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c new file mode 100644 index 0000000..fddeb63 --- /dev/null +++ b/drivers/md/dm-log-writes.c @@ -0,0 +1,809 @@ +/* + * Copyright (C) 2014 Facebook. All rights reserved. + * + * This file is released under the GPL. + */ + +#include <linux/device-mapper.h> + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/blkdev.h> +#include <linux/bio.h> +#include <linux/slab.h> +#include <linux/kthread.h> +#include <linux/freezer.h> + +#define DM_MSG_PREFIX "log-writes" + +/* + * This target will log sequentially all writes to the target device onto the + * log device. This is helpful for replaying writes to check for fs consitency + * at all times. This target provides a mechanism to mark specific events to + * check data at a later time. So for example you would + * + * write data + * fsync + * dmsetup message /dev/whatever mark mymark + * unmount /mnt/test + * + * Then replay the log up to mymark and check the contents of the replay to + * verify it matches what was written. + * + * We log writes only after they have been flushed, this makes the log describe + * close to the order in which the data hits the actual disk, not its cache. So + * for example the following sequence (W means write, C means complete) + * + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd + * + * Would result in the log looking like this + * + * c,a,flush,fuad,b,<other writes>,<next flush> + * + * This is meant to help expose problems where file systems do not properly wait + * on data being written before invoking a FLUSH. FUA bypasses cache so once it + * completes it is added to the log as it should be on disk. + * + * We treat DISCARDs as if they don't bypass cache so that they are logged in + * order of completion along with the normal writes. If we didn't do it this + * way we would process all the discards first and then write all the data, when + * in fact we want to do the data and the discard in the order that they + * completed. + */ +#define LOG_FLUSH_FLAG (1 << 0) +#define LOG_FUA_FLAG (1 << 1) +#define LOG_DISCARD_FLAG (1 << 2) +#define LOG_MARK_FLAG (1 << 3) + +#define WRITE_LOG_VERSION 1 +#define WRITE_LOG_MAGIC 0x6a736677736872 + +/* + * The disk format for this is braindead simple. + * + * At byte 0 we have our super, followed by the following sequence for + * nr_entries + * + * [ 1 sector ][ entry->nr_sectors ] + * [log_write_entry][ data written ] + * + * The log_write_entry takes up a full sector so we can have arbitrary length + * marks and it leaves us room for extra content in the future. + */ + +/* + * Basic info about the log for userspace. + */ +struct log_write_super { + __le64 magic; + __le64 version; + __le64 nr_entries; + __le32 sectorsize; +}; + +/* + * sector - the sector we wrote. + * nr_sectors - the number of sectors we wrote. + * flags - flags for this log entry. + * data_len - the size of the data in this log entry, this is for private log + * entry stuff, the MARK data provided by userspace for example. + */ +struct log_write_entry { + __le64 sector; + __le64 nr_sectors; + __le64 flags; + __le64 data_len; +}; + +struct log_writes_c { + struct dm_dev *dev; + struct dm_dev *logdev; + u64 logged_entries; + u32 sectorsize; + atomic_t io_blocks; + atomic_t pending_blocks; + sector_t next_sector; + sector_t end_sector; + bool logging_enabled; + bool device_supports_discard; + spinlock_t blocks_lock; + struct list_head unflushed_blocks; + struct list_head logging_blocks; + wait_queue_head_t wait; + struct task_struct *log_kthread; +}; + +struct pending_block { + int vec_cnt; + u64 flags; + sector_t sector; + sector_t nr_sectors; + char *data; + u32 datalen; + struct list_head list; + struct bio_vec vecs[0]; +}; + +struct per_bio_data { + struct pending_block *block; +}; + +static void log_end_io(struct bio *bio, int err) +{ + struct log_writes_c *lc = bio->bi_private; + struct bio_vec *bvec; + int i; + + if (err) { + unsigned long flags; + + DMERR("Error writing log block %d", err); + spin_lock_irqsave(&lc->blocks_lock, flags); + lc->logging_enabled = false; + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + bio_for_each_segment_all(bvec, bio, i) + __free_page(bvec->bv_page); + + atomic_dec(&lc->io_blocks); + wake_up(&lc->wait); + bio_put(bio); +} + +/* + * Meant to be called if there is an error, it will free all the pages + * associated with the block. + */ +static void free_pending_block(struct log_writes_c *lc, + struct pending_block *block) +{ + int i; + + for (i = 0; i < block->vec_cnt; i++) { + if (block->vecs[i].bv_page) + __free_page(block->vecs[i].bv_page); + } + kfree(block->data); + kfree(block); + atomic_dec(&lc->pending_blocks); + wake_up(&lc->wait); +} + +static int write_metadata(struct log_writes_c *lc, void *entry, + size_t entrylen, void *data, size_t datalen, + sector_t sector) +{ + struct bio *bio; + struct page *page; + void *ptr; + size_t ret; + + bio = bio_alloc(GFP_KERNEL, 1); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + page = alloc_page(GFP_KERNEL); + if (!page) { + DMERR("Couldn't alloc log page"); + bio_put(bio); + goto error; + } + + ptr = kmap_atomic(page); + memset(ptr, 0, lc->sectorsize); + memcpy(ptr, entry, entrylen); + if (datalen) + memcpy(ptr + entrylen, data, datalen); + kunmap_atomic(ptr); + + ret = bio_add_page(bio, page, lc->sectorsize, 0); + if (ret != lc->sectorsize) { + DMERR("Couldn't add page to the log block"); + goto error_bio; + } + submit_bio(WRITE, bio); + return 0; +error_bio: + bio_put(bio); + __free_page(page); +error: + atomic_dec(&lc->io_blocks); + wake_up(&lc->wait); + return -1; +} + +static int log_one_block(struct log_writes_c *lc, + struct pending_block *block, sector_t sector) +{ + struct bio *bio; + struct log_write_entry entry; + size_t ret; + int i; + + entry.sector = cpu_to_le64(block->sector); + entry.nr_sectors = cpu_to_le64(block->nr_sectors); + entry.flags = cpu_to_le64(block->flags); + entry.data_len = block->datalen; + if (write_metadata(lc, &entry, sizeof(entry), block->data, + block->datalen, sector)) { + free_pending_block(lc, block); + return -1; + } + + if (!block->vec_cnt) + goto out; + sector++; + + bio = bio_alloc(GFP_KERNEL, block->vec_cnt); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + atomic_inc(&lc->io_blocks); + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + for (i = 0; i < block->vec_cnt; i++) { + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + atomic_inc(&lc->io_blocks); + submit_bio(WRITE, bio); + bio = bio_alloc(GFP_KERNEL, block->vec_cnt - i); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + DMERR("Seriously?"); + wake_up(&lc->wait); + bio_put(bio); + goto error; + } + } + sector += block->vecs[i].bv_len >> SECTOR_SHIFT; + } + submit_bio(WRITE, bio); +out: + kfree(block->data); + kfree(block); + atomic_dec(&lc->pending_blocks); + wake_up(&lc->wait); + return 0; +error: + free_pending_block(lc, block); + atomic_dec(&lc->io_blocks); + wake_up(&lc->wait); + return -1; +} + +static int log_super(struct log_writes_c *lc) +{ + struct log_write_super super; + + super.magic = cpu_to_le64(WRITE_LOG_MAGIC); + super.version = cpu_to_le64(WRITE_LOG_VERSION); + super.nr_entries = cpu_to_le64(lc->logged_entries); + super.sectorsize = cpu_to_le32(lc->sectorsize); + + if (write_metadata(lc, &super, sizeof(super), NULL, 0, 0)) { + DMERR("Couldn't write super"); + return -1; + } + + return 0; +} + +static inline sector_t logdev_last_sector(struct log_writes_c *lc) +{ + return i_size_read(lc->logdev->bdev->bd_inode) >> SECTOR_SHIFT; +} + +static int log_writes_kthread(void *arg) +{ + struct log_writes_c *lc = (struct log_writes_c *)arg; + sector_t sector = 0; + + while (!kthread_should_stop()) { + bool super = false; + bool logging_enabled; + struct pending_block *block = NULL; + int ret; + + spin_lock_irq(&lc->blocks_lock); + if (!list_empty(&lc->logging_blocks)) { + block = list_first_entry(&lc->logging_blocks, + struct pending_block, list); + list_del_init(&block->list); + if (!lc->logging_enabled) + goto next; + + sector = lc->next_sector; + if (block->flags & LOG_DISCARD_FLAG) + lc->next_sector++; + else + lc->next_sector += block->nr_sectors + 1; + + /* + * Apparently the size of the device may not be known + * right away, so handle this properly. + */ + if (!lc->end_sector) + lc->end_sector = logdev_last_sector(lc); + if (lc->end_sector && + lc->next_sector > lc->end_sector) { + DMERR("Ran out of space on the logdev"); + lc->logging_enabled = false; + goto next; + } + lc->logged_entries++; + atomic_inc(&lc->io_blocks); + + super = (block->flags & (LOG_FUA_FLAG | LOG_MARK_FLAG)); + if (super) + atomic_inc(&lc->io_blocks); + } +next: + logging_enabled = lc->logging_enabled; + spin_unlock_irq(&lc->blocks_lock); + if (block) { + if (logging_enabled) { + ret = log_one_block(lc, block, sector); + if (!ret && super) + ret = log_super(lc); + if (ret) { + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + } + } else + free_pending_block(lc, block); + continue; + } + + if (!try_to_freeze()) { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop() && + !atomic_read(&lc->pending_blocks)) + schedule(); + __set_current_state(TASK_RUNNING); + } + } + return 0; +} + +/* + * Construct a log-writes mapping: + * log-writes <dev_path> <log_dev_path> + */ +static int log_writes_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct log_writes_c *lc; + struct dm_arg_set as; + const char *devname, *logdevname; + + as.argc = argc; + as.argv = argv; + + if (argc < 2) { + ti->error = "Invalid argument count"; + return -EINVAL; + } + + lc = kzalloc(sizeof(struct log_writes_c), GFP_KERNEL); + if (!lc) { + ti->error = "Cannot allocate context"; + return -ENOMEM; + } + spin_lock_init(&lc->blocks_lock); + INIT_LIST_HEAD(&lc->unflushed_blocks); + INIT_LIST_HEAD(&lc->logging_blocks); + init_waitqueue_head(&lc->wait); + lc->sectorsize = 1 << SECTOR_SHIFT; + atomic_set(&lc->io_blocks, 0); + atomic_set(&lc->pending_blocks, 0); + + devname = dm_shift_arg(&as); + if (dm_get_device(ti, devname, dm_table_get_mode(ti->table), &lc->dev)) { + ti->error = "Device lookup failed"; + goto bad; + } + + logdevname = dm_shift_arg(&as); + if (dm_get_device(ti, logdevname, dm_table_get_mode(ti->table), &lc->logdev)) { + ti->error = "Log device lookup failed"; + dm_put_device(ti, lc->dev); + goto bad; + } + + lc->log_kthread = kthread_run(log_writes_kthread, lc, "log-write"); + if (!lc->log_kthread) { + ti->error = "Couldn't alloc kthread"; + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + goto bad; + } + + /* We put the super at sector 0, start logging at sector 1 */ + lc->next_sector = 1; + lc->logging_enabled = true; + lc->end_sector = logdev_last_sector(lc); + lc->device_supports_discard = true; + + ti->num_flush_bios = 1; + ti->flush_supported = true; + ti->num_discard_bios = 1; + ti->discards_supported = true; + ti->per_bio_data_size = sizeof(struct per_bio_data); + ti->private = lc; + return 0; + +bad: + kfree(lc); + return -EINVAL; +} + +static int log_mark(struct log_writes_c *lc, char *data) +{ + struct pending_block *block; + size_t maxsize = lc->sectorsize - sizeof(struct log_write_entry); + + block = kzalloc(sizeof(struct pending_block), GFP_KERNEL); + if (!block) { + DMERR("Error allocating pending block"); + return -ENOMEM; + } + + block->data = kstrndup(data, maxsize, GFP_KERNEL); + if (!block->data) { + DMERR("Error copying mark data"); + kfree(block); + return -ENOMEM; + } + atomic_inc(&lc->pending_blocks); + block->datalen = strlen(block->data); + block->flags |= LOG_MARK_FLAG; + spin_lock_irq(&lc->blocks_lock); + list_add_tail(&block->list, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + wake_up_process(lc->log_kthread); + return 0; +} + +static void log_writes_dtr(struct dm_target *ti) +{ + struct log_writes_c *lc = ti->private; + + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + + /* + * This is just nice to have since it'll update the super to include the + * unflushed blocks, if it fails we don't really care. + */ + log_mark(lc, "dm-log-writes-end"); + wake_up_process(lc->log_kthread); + wait_event(lc->wait, !atomic_read(&lc->io_blocks) && + !atomic_read(&lc->pending_blocks)); + kthread_stop(lc->log_kthread); + + WARN_ON(!list_empty(&lc->logging_blocks)); + WARN_ON(!list_empty(&lc->unflushed_blocks)); + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + kfree(lc); +} + +static void normal_map_bio(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + + bio->bi_bdev = lc->dev->bdev; +} + +static int log_writes_map(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + struct pending_block *block; + struct bvec_iter iter; + struct bio_vec bv; + size_t alloc_size; + int i = 0; + bool flush_bio = (bio->bi_rw & REQ_FLUSH); + bool fua_bio = (bio->bi_rw & REQ_FUA); + bool discard_bio = (bio->bi_rw & REQ_DISCARD); + + pb->block = NULL; + + /* Don't bother doing anything if logging has been disabled */ + if (!lc->logging_enabled) + goto map_bio; + + /* + * Map reads as normal. + */ + if (bio_data_dir(bio) == READ) + goto map_bio; + + /* No sectors and not a flush? Don't care */ + if (!bio_sectors(bio) && !flush_bio) + goto map_bio; + + /* + * Discards will have bi_size set but there's no actual data, so just + * allocate the size of the pending block. + */ + if (discard_bio) + alloc_size = sizeof(struct pending_block); + else + alloc_size = sizeof(struct pending_block) + sizeof(struct bio_vec) * bio_segments(bio); + + block = kzalloc(alloc_size, GFP_NOIO); + if (!block) { + DMERR("Error allocating pending block"); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + INIT_LIST_HEAD(&block->list); + pb->block = block; + atomic_inc(&lc->pending_blocks); + + if (flush_bio) + block->flags |= LOG_FLUSH_FLAG; + if (fua_bio) + block->flags |= LOG_FUA_FLAG; + if (discard_bio) + block->flags |= LOG_DISCARD_FLAG; + + block->sector = bio->bi_iter.bi_sector; + block->nr_sectors = bio_sectors(bio); + + /* We don't need the data, just submit */ + if (discard_bio) { + WARN_ON(flush_bio || fua_bio); + if (lc->device_supports_discard) + goto map_bio; + bio_endio(bio, 0); + return DM_MAPIO_SUBMITTED; + } + + /* Flush bio, splice the unflushed blocks onto this list and submit */ + if (flush_bio && !bio_sectors(bio)) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + goto map_bio; + } + + /* + * We will write this bio somewhere else way later so we need to copy + * the actual contents into new pages so we know the data will always be + * there. + * + * We do this because this could be a bio from O_DIRECT in which case we + * can't just hold onto the page until some later point, we have to + * manually copy the contents. + */ + bio_for_each_segment(bv, bio, iter) { + struct page *page; + void *src, *dst; + + page = alloc_page(GFP_NOIO); + if (!page) { + DMERR("Error allocing page"); + free_pending_block(lc, block); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + + src = kmap_atomic(bv.bv_page); + dst = kmap_atomic(page); + memcpy(dst, src + bv.bv_offset, bv.bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + block->vecs[i].bv_page = page; + block->vecs[i].bv_len = bv.bv_len; + block->vec_cnt++; + i++; + } + + /* Had a flush with data in it, weird */ + if (flush_bio) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + } +map_bio: + normal_map_bio(ti, bio); + return DM_MAPIO_REMAPPED; +} + +static int normal_end_io(struct dm_target *ti, struct bio *bio, int error) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + + if (bio_data_dir(bio) == WRITE && pb->block) { + struct pending_block *block = pb->block; + unsigned long flags; + + spin_lock_irqsave(&lc->blocks_lock, flags); + if (block->flags & LOG_FLUSH_FLAG) { + list_splice_tail_init(&block->list, &lc->logging_blocks); + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else if (block->flags & LOG_FUA_FLAG) { + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else + list_add_tail(&block->list, &lc->unflushed_blocks); + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + return error; +} + +/* + * INFO format: <logged entries> <highest allocated sector> + */ +static void log_writes_status(struct dm_target *ti, status_type_t type, + unsigned status_flags, char *result, + unsigned maxlen) +{ + unsigned sz = 0; + struct log_writes_c *lc = ti->private; + + switch (type) { + case STATUSTYPE_INFO: + DMEMIT("%llu %llu", lc->logged_entries, + (unsigned long long)lc->next_sector - 1); + if (!lc->logging_enabled) + DMEMIT(" logging_disabled"); + break; + + case STATUSTYPE_TABLE: + DMEMIT("%s %s", lc->dev->name, lc->logdev->name); + break; + } +} + +static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd, + unsigned long arg) +{ + struct log_writes_c *lc = ti->private; + struct dm_dev *dev = lc->dev; + int r = 0; + + /* + * Only pass ioctls through if the device sizes match exactly. + */ + if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT) + r = scsi_verify_blk_ioctl(NULL, cmd); + + return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg); +} + +static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = lc->dev->bdev; + bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector); + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static int log_writes_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, + void *data) +{ + struct log_writes_c *lc = ti->private; + + return fn(ti, lc->dev, 0, ti->len, data); +} + +/* + * Messages supported: + * mark <mark data> - specify the marked data. + */ +static int log_writes_message(struct dm_target *ti, unsigned argc, char **argv) +{ + int r = -EINVAL; + struct log_writes_c *lc = ti->private; + + if (argc != 2) { + DMWARN("Invalid log-writes message arguments, expect 2 arguments, got %d", argc); + return r; + } + + if (!strcasecmp(argv[0], "mark")) + r = log_mark(lc, argv[1]); + else + DMWARN("Unrecognised log writes target message received: %s", argv[0]); + + return r; +} + +static void log_writes_io_hints(struct dm_target *ti, struct queue_limits *limits) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q || !blk_queue_discard(q)) { + lc->device_supports_discard = false; + limits->discard_granularity = 1 << SECTOR_SHIFT; + limits->max_discard_sectors = (UINT_MAX >> SECTOR_SHIFT); + } +} + +static struct target_type log_writes_target = { + .name = "log-writes", + .version = {1, 0, 0}, + .module = THIS_MODULE, + .ctr = log_writes_ctr, + .dtr = log_writes_dtr, + .map = log_writes_map, + .end_io = normal_end_io, + .status = log_writes_status, + .ioctl = log_writes_ioctl, + .merge = log_writes_merge, + .message = log_writes_message, + .iterate_devices = log_writes_iterate_devices, + .io_hints = log_writes_io_hints, +}; + +static int __init dm_log_writes_init(void) +{ + int r = dm_register_target(&log_writes_target); + + if (r < 0) + DMERR("register failed %d", r); + + return r; +} + +static void __exit dm_log_writes_exit(void) +{ + dm_unregister_target(&log_writes_target); +} + +/* Module hooks */ +module_init(dm_log_writes_init); +module_exit(dm_log_writes_exit); + +MODULE_DESCRIPTION(DM_NAME " log writes target"); +MODULE_AUTHOR("Josef Bacik <jbacik@fb.com>"); +MODULE_LICENSE("GPL"); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 1/3] dm: log writes target @ 2015-03-19 20:31 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests This creates a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and the data and log the requests and the data to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers to verify that the file system is always consistent. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- Documentation/device-mapper/dm-log-writes.txt | 136 +++++ drivers/md/Kconfig | 16 + drivers/md/Makefile | 1 + drivers/md/dm-log-writes.c | 809 ++++++++++++++++++++++++++ 4 files changed, 962 insertions(+) create mode 100644 Documentation/device-mapper/dm-log-writes.txt create mode 100644 drivers/md/dm-log-writes.c diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt new file mode 100644 index 0000000..f3a9fa2 --- /dev/null +++ b/Documentation/device-mapper/dm-log-writes.txt @@ -0,0 +1,136 @@ +dm-log-writes +============= + +This target takes 2 devices, one to pass all IO to normally, and one to log all +of the write operations to. This is intended for file system developers wishing +to verify the integrity of metadata or data as the file system is written to. +There is a log_writes_entry written for every WRITE request and the target is +able to take arbitrary data from userspace to insert into the log. The data +that is in the WRITE requests is copied into the log to make the replay happen +exactly as it happened originally. + +Log Ordering +============ + +We log things in order of completion once we are sure the write is no longer in +cache. This means that normal WRITE requests are not actually logged until the +next REQ_FLUSH request. This is to make it easier for userspace to replay the +log in a way that correlates to what is on disk and not what is in cache, to +make it easier to detect improper waiting/flushing. + +This works by attaching all WRITE requests to a list once the write completes. +Once we see a REQ_FLUSH request we splice this list onto the request and once +the FLUSH request completes we log all of the WRITE's and then the FLUSH. Only +completeled WRITEs at the time of the issue of the REQ_FLUSH are added in order +to simulate the worst case scenario with regard to power failures. Consider the +following example (W means write, C means complete) + +W1,W2,W3,C3,C2,Wflush,C1,Cflush + +The log would show the following + +W3,W2,flush,W1.... + +Again this is to simulate what is actually on disk, this allows us to detect +cases where a power failure at a particular point in time would create an +inconsistent file system. + +Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as +they complete as those requests will obviously bypass the device cache. + +Any REQ_DISCARD requests are treated like WRITE requests. This is because +otherwise we would have all the DISCARD requests, and then the WRITE requests +and then the FLUSH request. Consider the following example + +WRITE block 1, DISCARD block 1, FLUSH + +If we logged DISCARD when it completed, the replay would look like this + +DISCARD 1, WRITE 1, FLUSH + +which isn't quite what happened and wouldn't be caught during the log replay. + +Marks +===== + +You can use dmsetup to set an arbitrary mark in a log. For example say you want +to fsck an file system after every write, but first you need to replay up to the +mkfs to make sure we're fsck'ing something reasonable, you would do something +like this + +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs +<run test> + +This would allow you to replay the log up to the mkfs mark and then replay from +that point on doing the fsck check in the interval that you want. + +Every log has a mark at the end labeled "log-writes-end". + +Userspace component +=================== + +There is a userspace tool that will replay the log for you in various ways. +As of this writing the options are not well documented, they will be in the +future. It can be found here + +https://github.com/josefbacik/log-writes + +Example usage +============= + +Say you want to test fsync on your file system. You would do something like +this + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<some test that does fsync at the end> +dmsetup message log 0 mark fsync +md5sum /mnt/btrfs-test/foo +umount /mnt/btrfs-test + +dmsetup remove log +replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync +mount /dev/sdb /mnt/btrfs-test +md5sum /mnt/btrfs-test/foo +<verify md5sum's are correct> + +Another option is to do a complicated file system operation and verify the file +system is consistent during the entire operation. You could do this by doing + + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<fsstress to dirty the fs> +btrfs filesystem balance /mnt/btrfs-test +umount /mnt/btrfs-test +dmsetup remove log + +replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs +btrfsck /dev/sdb +replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ + --fsck "btrfsck /dev/sdb" --check fua + +And that will replay the log until it sees a FUA request, run the fsck command +and if the fsck passes it will replay to the next FUA, until it is completed or +the fsck command exists abnormally. + +Table Parameters +---------------- + <dev path> <dev path for log> + +Mandatory parameters: + <dev path>: Full pathname to the underlying block-device, or a "major:minor" + device-number. This device is the one that all of the IO will go + to normally, just think of it as a normal linear mapping. + <dev path for log>: Same format as <dev path>, this is the device where the + log entries are written to. + diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 63e05e3..f928ad5 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -432,4 +432,20 @@ config DM_SWITCH If unsure, say N. +config DM_LOG_WRITES + tristate "Log writes target support" + depends on BLK_DEV_DM + ---help--- + This device-mapper target takes two devices, one device to use + normally, one to log all write operations done to the first device. + This is for use by file system developers wishing to verify that + their fs is writing a consitent file system at all times by allowing + them to replay the log in a variety of ways and to check the + contents. + + To compile this code as a module, choose M here: the module will + be called dm-log-writes. + + If unsure, say N. + endif # MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index a2da532..1863fea 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -55,6 +55,7 @@ obj-$(CONFIG_DM_CACHE) += dm-cache.o obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o obj-$(CONFIG_DM_ERA) += dm-era.o +obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs += dm-uevent.o diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c new file mode 100644 index 0000000..fddeb63 --- /dev/null +++ b/drivers/md/dm-log-writes.c @@ -0,0 +1,809 @@ +/* + * Copyright (C) 2014 Facebook. All rights reserved. + * + * This file is released under the GPL. + */ + +#include <linux/device-mapper.h> + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/blkdev.h> +#include <linux/bio.h> +#include <linux/slab.h> +#include <linux/kthread.h> +#include <linux/freezer.h> + +#define DM_MSG_PREFIX "log-writes" + +/* + * This target will log sequentially all writes to the target device onto the + * log device. This is helpful for replaying writes to check for fs consitency + * at all times. This target provides a mechanism to mark specific events to + * check data at a later time. So for example you would + * + * write data + * fsync + * dmsetup message /dev/whatever mark mymark + * unmount /mnt/test + * + * Then replay the log up to mymark and check the contents of the replay to + * verify it matches what was written. + * + * We log writes only after they have been flushed, this makes the log describe + * close to the order in which the data hits the actual disk, not its cache. So + * for example the following sequence (W means write, C means complete) + * + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd + * + * Would result in the log looking like this + * + * c,a,flush,fuad,b,<other writes>,<next flush> + * + * This is meant to help expose problems where file systems do not properly wait + * on data being written before invoking a FLUSH. FUA bypasses cache so once it + * completes it is added to the log as it should be on disk. + * + * We treat DISCARDs as if they don't bypass cache so that they are logged in + * order of completion along with the normal writes. If we didn't do it this + * way we would process all the discards first and then write all the data, when + * in fact we want to do the data and the discard in the order that they + * completed. + */ +#define LOG_FLUSH_FLAG (1 << 0) +#define LOG_FUA_FLAG (1 << 1) +#define LOG_DISCARD_FLAG (1 << 2) +#define LOG_MARK_FLAG (1 << 3) + +#define WRITE_LOG_VERSION 1 +#define WRITE_LOG_MAGIC 0x6a736677736872 + +/* + * The disk format for this is braindead simple. + * + * At byte 0 we have our super, followed by the following sequence for + * nr_entries + * + * [ 1 sector ][ entry->nr_sectors ] + * [log_write_entry][ data written ] + * + * The log_write_entry takes up a full sector so we can have arbitrary length + * marks and it leaves us room for extra content in the future. + */ + +/* + * Basic info about the log for userspace. + */ +struct log_write_super { + __le64 magic; + __le64 version; + __le64 nr_entries; + __le32 sectorsize; +}; + +/* + * sector - the sector we wrote. + * nr_sectors - the number of sectors we wrote. + * flags - flags for this log entry. + * data_len - the size of the data in this log entry, this is for private log + * entry stuff, the MARK data provided by userspace for example. + */ +struct log_write_entry { + __le64 sector; + __le64 nr_sectors; + __le64 flags; + __le64 data_len; +}; + +struct log_writes_c { + struct dm_dev *dev; + struct dm_dev *logdev; + u64 logged_entries; + u32 sectorsize; + atomic_t io_blocks; + atomic_t pending_blocks; + sector_t next_sector; + sector_t end_sector; + bool logging_enabled; + bool device_supports_discard; + spinlock_t blocks_lock; + struct list_head unflushed_blocks; + struct list_head logging_blocks; + wait_queue_head_t wait; + struct task_struct *log_kthread; +}; + +struct pending_block { + int vec_cnt; + u64 flags; + sector_t sector; + sector_t nr_sectors; + char *data; + u32 datalen; + struct list_head list; + struct bio_vec vecs[0]; +}; + +struct per_bio_data { + struct pending_block *block; +}; + +static void log_end_io(struct bio *bio, int err) +{ + struct log_writes_c *lc = bio->bi_private; + struct bio_vec *bvec; + int i; + + if (err) { + unsigned long flags; + + DMERR("Error writing log block %d", err); + spin_lock_irqsave(&lc->blocks_lock, flags); + lc->logging_enabled = false; + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + bio_for_each_segment_all(bvec, bio, i) + __free_page(bvec->bv_page); + + atomic_dec(&lc->io_blocks); + wake_up(&lc->wait); + bio_put(bio); +} + +/* + * Meant to be called if there is an error, it will free all the pages + * associated with the block. + */ +static void free_pending_block(struct log_writes_c *lc, + struct pending_block *block) +{ + int i; + + for (i = 0; i < block->vec_cnt; i++) { + if (block->vecs[i].bv_page) + __free_page(block->vecs[i].bv_page); + } + kfree(block->data); + kfree(block); + atomic_dec(&lc->pending_blocks); + wake_up(&lc->wait); +} + +static int write_metadata(struct log_writes_c *lc, void *entry, + size_t entrylen, void *data, size_t datalen, + sector_t sector) +{ + struct bio *bio; + struct page *page; + void *ptr; + size_t ret; + + bio = bio_alloc(GFP_KERNEL, 1); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + page = alloc_page(GFP_KERNEL); + if (!page) { + DMERR("Couldn't alloc log page"); + bio_put(bio); + goto error; + } + + ptr = kmap_atomic(page); + memset(ptr, 0, lc->sectorsize); + memcpy(ptr, entry, entrylen); + if (datalen) + memcpy(ptr + entrylen, data, datalen); + kunmap_atomic(ptr); + + ret = bio_add_page(bio, page, lc->sectorsize, 0); + if (ret != lc->sectorsize) { + DMERR("Couldn't add page to the log block"); + goto error_bio; + } + submit_bio(WRITE, bio); + return 0; +error_bio: + bio_put(bio); + __free_page(page); +error: + atomic_dec(&lc->io_blocks); + wake_up(&lc->wait); + return -1; +} + +static int log_one_block(struct log_writes_c *lc, + struct pending_block *block, sector_t sector) +{ + struct bio *bio; + struct log_write_entry entry; + size_t ret; + int i; + + entry.sector = cpu_to_le64(block->sector); + entry.nr_sectors = cpu_to_le64(block->nr_sectors); + entry.flags = cpu_to_le64(block->flags); + entry.data_len = block->datalen; + if (write_metadata(lc, &entry, sizeof(entry), block->data, + block->datalen, sector)) { + free_pending_block(lc, block); + return -1; + } + + if (!block->vec_cnt) + goto out; + sector++; + + bio = bio_alloc(GFP_KERNEL, block->vec_cnt); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + atomic_inc(&lc->io_blocks); + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + for (i = 0; i < block->vec_cnt; i++) { + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + atomic_inc(&lc->io_blocks); + submit_bio(WRITE, bio); + bio = bio_alloc(GFP_KERNEL, block->vec_cnt - i); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + DMERR("Seriously?"); + wake_up(&lc->wait); + bio_put(bio); + goto error; + } + } + sector += block->vecs[i].bv_len >> SECTOR_SHIFT; + } + submit_bio(WRITE, bio); +out: + kfree(block->data); + kfree(block); + atomic_dec(&lc->pending_blocks); + wake_up(&lc->wait); + return 0; +error: + free_pending_block(lc, block); + atomic_dec(&lc->io_blocks); + wake_up(&lc->wait); + return -1; +} + +static int log_super(struct log_writes_c *lc) +{ + struct log_write_super super; + + super.magic = cpu_to_le64(WRITE_LOG_MAGIC); + super.version = cpu_to_le64(WRITE_LOG_VERSION); + super.nr_entries = cpu_to_le64(lc->logged_entries); + super.sectorsize = cpu_to_le32(lc->sectorsize); + + if (write_metadata(lc, &super, sizeof(super), NULL, 0, 0)) { + DMERR("Couldn't write super"); + return -1; + } + + return 0; +} + +static inline sector_t logdev_last_sector(struct log_writes_c *lc) +{ + return i_size_read(lc->logdev->bdev->bd_inode) >> SECTOR_SHIFT; +} + +static int log_writes_kthread(void *arg) +{ + struct log_writes_c *lc = (struct log_writes_c *)arg; + sector_t sector = 0; + + while (!kthread_should_stop()) { + bool super = false; + bool logging_enabled; + struct pending_block *block = NULL; + int ret; + + spin_lock_irq(&lc->blocks_lock); + if (!list_empty(&lc->logging_blocks)) { + block = list_first_entry(&lc->logging_blocks, + struct pending_block, list); + list_del_init(&block->list); + if (!lc->logging_enabled) + goto next; + + sector = lc->next_sector; + if (block->flags & LOG_DISCARD_FLAG) + lc->next_sector++; + else + lc->next_sector += block->nr_sectors + 1; + + /* + * Apparently the size of the device may not be known + * right away, so handle this properly. + */ + if (!lc->end_sector) + lc->end_sector = logdev_last_sector(lc); + if (lc->end_sector && + lc->next_sector > lc->end_sector) { + DMERR("Ran out of space on the logdev"); + lc->logging_enabled = false; + goto next; + } + lc->logged_entries++; + atomic_inc(&lc->io_blocks); + + super = (block->flags & (LOG_FUA_FLAG | LOG_MARK_FLAG)); + if (super) + atomic_inc(&lc->io_blocks); + } +next: + logging_enabled = lc->logging_enabled; + spin_unlock_irq(&lc->blocks_lock); + if (block) { + if (logging_enabled) { + ret = log_one_block(lc, block, sector); + if (!ret && super) + ret = log_super(lc); + if (ret) { + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + } + } else + free_pending_block(lc, block); + continue; + } + + if (!try_to_freeze()) { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop() && + !atomic_read(&lc->pending_blocks)) + schedule(); + __set_current_state(TASK_RUNNING); + } + } + return 0; +} + +/* + * Construct a log-writes mapping: + * log-writes <dev_path> <log_dev_path> + */ +static int log_writes_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct log_writes_c *lc; + struct dm_arg_set as; + const char *devname, *logdevname; + + as.argc = argc; + as.argv = argv; + + if (argc < 2) { + ti->error = "Invalid argument count"; + return -EINVAL; + } + + lc = kzalloc(sizeof(struct log_writes_c), GFP_KERNEL); + if (!lc) { + ti->error = "Cannot allocate context"; + return -ENOMEM; + } + spin_lock_init(&lc->blocks_lock); + INIT_LIST_HEAD(&lc->unflushed_blocks); + INIT_LIST_HEAD(&lc->logging_blocks); + init_waitqueue_head(&lc->wait); + lc->sectorsize = 1 << SECTOR_SHIFT; + atomic_set(&lc->io_blocks, 0); + atomic_set(&lc->pending_blocks, 0); + + devname = dm_shift_arg(&as); + if (dm_get_device(ti, devname, dm_table_get_mode(ti->table), &lc->dev)) { + ti->error = "Device lookup failed"; + goto bad; + } + + logdevname = dm_shift_arg(&as); + if (dm_get_device(ti, logdevname, dm_table_get_mode(ti->table), &lc->logdev)) { + ti->error = "Log device lookup failed"; + dm_put_device(ti, lc->dev); + goto bad; + } + + lc->log_kthread = kthread_run(log_writes_kthread, lc, "log-write"); + if (!lc->log_kthread) { + ti->error = "Couldn't alloc kthread"; + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + goto bad; + } + + /* We put the super at sector 0, start logging at sector 1 */ + lc->next_sector = 1; + lc->logging_enabled = true; + lc->end_sector = logdev_last_sector(lc); + lc->device_supports_discard = true; + + ti->num_flush_bios = 1; + ti->flush_supported = true; + ti->num_discard_bios = 1; + ti->discards_supported = true; + ti->per_bio_data_size = sizeof(struct per_bio_data); + ti->private = lc; + return 0; + +bad: + kfree(lc); + return -EINVAL; +} + +static int log_mark(struct log_writes_c *lc, char *data) +{ + struct pending_block *block; + size_t maxsize = lc->sectorsize - sizeof(struct log_write_entry); + + block = kzalloc(sizeof(struct pending_block), GFP_KERNEL); + if (!block) { + DMERR("Error allocating pending block"); + return -ENOMEM; + } + + block->data = kstrndup(data, maxsize, GFP_KERNEL); + if (!block->data) { + DMERR("Error copying mark data"); + kfree(block); + return -ENOMEM; + } + atomic_inc(&lc->pending_blocks); + block->datalen = strlen(block->data); + block->flags |= LOG_MARK_FLAG; + spin_lock_irq(&lc->blocks_lock); + list_add_tail(&block->list, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + wake_up_process(lc->log_kthread); + return 0; +} + +static void log_writes_dtr(struct dm_target *ti) +{ + struct log_writes_c *lc = ti->private; + + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + + /* + * This is just nice to have since it'll update the super to include the + * unflushed blocks, if it fails we don't really care. + */ + log_mark(lc, "dm-log-writes-end"); + wake_up_process(lc->log_kthread); + wait_event(lc->wait, !atomic_read(&lc->io_blocks) && + !atomic_read(&lc->pending_blocks)); + kthread_stop(lc->log_kthread); + + WARN_ON(!list_empty(&lc->logging_blocks)); + WARN_ON(!list_empty(&lc->unflushed_blocks)); + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + kfree(lc); +} + +static void normal_map_bio(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + + bio->bi_bdev = lc->dev->bdev; +} + +static int log_writes_map(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + struct pending_block *block; + struct bvec_iter iter; + struct bio_vec bv; + size_t alloc_size; + int i = 0; + bool flush_bio = (bio->bi_rw & REQ_FLUSH); + bool fua_bio = (bio->bi_rw & REQ_FUA); + bool discard_bio = (bio->bi_rw & REQ_DISCARD); + + pb->block = NULL; + + /* Don't bother doing anything if logging has been disabled */ + if (!lc->logging_enabled) + goto map_bio; + + /* + * Map reads as normal. + */ + if (bio_data_dir(bio) == READ) + goto map_bio; + + /* No sectors and not a flush? Don't care */ + if (!bio_sectors(bio) && !flush_bio) + goto map_bio; + + /* + * Discards will have bi_size set but there's no actual data, so just + * allocate the size of the pending block. + */ + if (discard_bio) + alloc_size = sizeof(struct pending_block); + else + alloc_size = sizeof(struct pending_block) + sizeof(struct bio_vec) * bio_segments(bio); + + block = kzalloc(alloc_size, GFP_NOIO); + if (!block) { + DMERR("Error allocating pending block"); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + INIT_LIST_HEAD(&block->list); + pb->block = block; + atomic_inc(&lc->pending_blocks); + + if (flush_bio) + block->flags |= LOG_FLUSH_FLAG; + if (fua_bio) + block->flags |= LOG_FUA_FLAG; + if (discard_bio) + block->flags |= LOG_DISCARD_FLAG; + + block->sector = bio->bi_iter.bi_sector; + block->nr_sectors = bio_sectors(bio); + + /* We don't need the data, just submit */ + if (discard_bio) { + WARN_ON(flush_bio || fua_bio); + if (lc->device_supports_discard) + goto map_bio; + bio_endio(bio, 0); + return DM_MAPIO_SUBMITTED; + } + + /* Flush bio, splice the unflushed blocks onto this list and submit */ + if (flush_bio && !bio_sectors(bio)) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + goto map_bio; + } + + /* + * We will write this bio somewhere else way later so we need to copy + * the actual contents into new pages so we know the data will always be + * there. + * + * We do this because this could be a bio from O_DIRECT in which case we + * can't just hold onto the page until some later point, we have to + * manually copy the contents. + */ + bio_for_each_segment(bv, bio, iter) { + struct page *page; + void *src, *dst; + + page = alloc_page(GFP_NOIO); + if (!page) { + DMERR("Error allocing page"); + free_pending_block(lc, block); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + + src = kmap_atomic(bv.bv_page); + dst = kmap_atomic(page); + memcpy(dst, src + bv.bv_offset, bv.bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + block->vecs[i].bv_page = page; + block->vecs[i].bv_len = bv.bv_len; + block->vec_cnt++; + i++; + } + + /* Had a flush with data in it, weird */ + if (flush_bio) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + } +map_bio: + normal_map_bio(ti, bio); + return DM_MAPIO_REMAPPED; +} + +static int normal_end_io(struct dm_target *ti, struct bio *bio, int error) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + + if (bio_data_dir(bio) == WRITE && pb->block) { + struct pending_block *block = pb->block; + unsigned long flags; + + spin_lock_irqsave(&lc->blocks_lock, flags); + if (block->flags & LOG_FLUSH_FLAG) { + list_splice_tail_init(&block->list, &lc->logging_blocks); + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else if (block->flags & LOG_FUA_FLAG) { + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else + list_add_tail(&block->list, &lc->unflushed_blocks); + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + return error; +} + +/* + * INFO format: <logged entries> <highest allocated sector> + */ +static void log_writes_status(struct dm_target *ti, status_type_t type, + unsigned status_flags, char *result, + unsigned maxlen) +{ + unsigned sz = 0; + struct log_writes_c *lc = ti->private; + + switch (type) { + case STATUSTYPE_INFO: + DMEMIT("%llu %llu", lc->logged_entries, + (unsigned long long)lc->next_sector - 1); + if (!lc->logging_enabled) + DMEMIT(" logging_disabled"); + break; + + case STATUSTYPE_TABLE: + DMEMIT("%s %s", lc->dev->name, lc->logdev->name); + break; + } +} + +static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd, + unsigned long arg) +{ + struct log_writes_c *lc = ti->private; + struct dm_dev *dev = lc->dev; + int r = 0; + + /* + * Only pass ioctls through if the device sizes match exactly. + */ + if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT) + r = scsi_verify_blk_ioctl(NULL, cmd); + + return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg); +} + +static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = lc->dev->bdev; + bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector); + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static int log_writes_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, + void *data) +{ + struct log_writes_c *lc = ti->private; + + return fn(ti, lc->dev, 0, ti->len, data); +} + +/* + * Messages supported: + * mark <mark data> - specify the marked data. + */ +static int log_writes_message(struct dm_target *ti, unsigned argc, char **argv) +{ + int r = -EINVAL; + struct log_writes_c *lc = ti->private; + + if (argc != 2) { + DMWARN("Invalid log-writes message arguments, expect 2 arguments, got %d", argc); + return r; + } + + if (!strcasecmp(argv[0], "mark")) + r = log_mark(lc, argv[1]); + else + DMWARN("Unrecognised log writes target message received: %s", argv[0]); + + return r; +} + +static void log_writes_io_hints(struct dm_target *ti, struct queue_limits *limits) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q || !blk_queue_discard(q)) { + lc->device_supports_discard = false; + limits->discard_granularity = 1 << SECTOR_SHIFT; + limits->max_discard_sectors = (UINT_MAX >> SECTOR_SHIFT); + } +} + +static struct target_type log_writes_target = { + .name = "log-writes", + .version = {1, 0, 0}, + .module = THIS_MODULE, + .ctr = log_writes_ctr, + .dtr = log_writes_dtr, + .map = log_writes_map, + .end_io = normal_end_io, + .status = log_writes_status, + .ioctl = log_writes_ioctl, + .merge = log_writes_merge, + .message = log_writes_message, + .iterate_devices = log_writes_iterate_devices, + .io_hints = log_writes_io_hints, +}; + +static int __init dm_log_writes_init(void) +{ + int r = dm_register_target(&log_writes_target); + + if (r < 0) + DMERR("register failed %d", r); + + return r; +} + +static void __exit dm_log_writes_exit(void) +{ + dm_unregister_target(&log_writes_target); +} + +/* Module hooks */ +module_init(dm_log_writes_init); +module_exit(dm_log_writes_exit); + +MODULE_DESCRIPTION(DM_NAME " log writes target"); +MODULE_AUTHOR("Josef Bacik <jbacik@fb.com>"); +MODULE_LICENSE("GPL"); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target 2015-03-19 20:31 ` Josef Bacik (?) @ 2015-03-19 23:16 ` Zach Brown 2015-03-20 14:50 ` Josef Bacik -1 siblings, 1 reply; 22+ messages in thread From: Zach Brown @ 2015-03-19 23:16 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, linux-fsdevel, dm-devel, fstests On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: > This creates a new target that is meant for file system developers to test file > system integrity at particular points in the life of a file system. Hi Josef, just a quick drive-by review for stuff that jumps out at me.. > + atomic_dec(&lc->io_blocks); > + wake_up(&lc->wait); This waitqueue is only used by the destructor? Seems worth putting this off in a helper that tests the waitqueue so that it avoids taking locks in the common case where nothing is waiting. atomic_dec(&lc->io_blocks); smp_mb__after_atomic(); /* see wake_up_bit() comment */ if (waitqueue_active(&lc->wait)) wake_up(&lc->wait); > + ptr = kmap_atomic(page); > + memset(ptr, 0, lc->sectorsize); > + memcpy(ptr, entry, entrylen); > + if (datalen) > + memcpy(ptr + entrylen, data, datalen); > + kunmap_atomic(ptr); Drop the initial zeroing and only zero a remaining tail fragment? memset(ptr + entry + data, 0, sector - (entry + data)) > + entry.sector = cpu_to_le64(block->sector); > + entry.nr_sectors = cpu_to_le64(block->nr_sectors); > + entry.flags = cpu_to_le64(block->flags); > + entry.data_len = block->datalen; Missing cpu_to_le64? Build with sparse? > + for (i = 0; i < block->vec_cnt; i++) { > + ret = bio_add_page(bio, block->vecs[i].bv_page, > + block->vecs[i].bv_len, 0); It took me a minute to figure out that the offsets are always 0 because each page starts with a copy of one bvec segment. > + sector = lc->next_sector; > + if (block->flags & LOG_DISCARD_FLAG) > + lc->next_sector++; > + else > + lc->next_sector += block->nr_sectors + 1; > + > + /* > + * Apparently the size of the device may not be known > + * right away, so handle this properly. > + */ > + if (!lc->end_sector) > + lc->end_sector = logdev_last_sector(lc); > + if (lc->end_sector && > + lc->next_sector > lc->end_sector) { Does that need to be >= to avoid trying to write to the sector at the device's i_size? - z ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] dm: log writes target V2 2015-03-19 23:16 ` Zach Brown @ 2015-03-20 14:50 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-20 14:50 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, fstests, zab This creates a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and the data and log the requests and the data to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers to verify that the file system is always consistent. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- V1->V2: fixed up stuff based on Zachs review. Documentation/device-mapper/dm-log-writes.txt | 136 +++++ drivers/md/Kconfig | 16 + drivers/md/Makefile | 1 + drivers/md/dm-log-writes.c | 826 ++++++++++++++++++++++++++ 4 files changed, 979 insertions(+) create mode 100644 Documentation/device-mapper/dm-log-writes.txt create mode 100644 drivers/md/dm-log-writes.c diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt new file mode 100644 index 0000000..f3a9fa2 --- /dev/null +++ b/Documentation/device-mapper/dm-log-writes.txt @@ -0,0 +1,136 @@ +dm-log-writes +============= + +This target takes 2 devices, one to pass all IO to normally, and one to log all +of the write operations to. This is intended for file system developers wishing +to verify the integrity of metadata or data as the file system is written to. +There is a log_writes_entry written for every WRITE request and the target is +able to take arbitrary data from userspace to insert into the log. The data +that is in the WRITE requests is copied into the log to make the replay happen +exactly as it happened originally. + +Log Ordering +============ + +We log things in order of completion once we are sure the write is no longer in +cache. This means that normal WRITE requests are not actually logged until the +next REQ_FLUSH request. This is to make it easier for userspace to replay the +log in a way that correlates to what is on disk and not what is in cache, to +make it easier to detect improper waiting/flushing. + +This works by attaching all WRITE requests to a list once the write completes. +Once we see a REQ_FLUSH request we splice this list onto the request and once +the FLUSH request completes we log all of the WRITE's and then the FLUSH. Only +completeled WRITEs at the time of the issue of the REQ_FLUSH are added in order +to simulate the worst case scenario with regard to power failures. Consider the +following example (W means write, C means complete) + +W1,W2,W3,C3,C2,Wflush,C1,Cflush + +The log would show the following + +W3,W2,flush,W1.... + +Again this is to simulate what is actually on disk, this allows us to detect +cases where a power failure at a particular point in time would create an +inconsistent file system. + +Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as +they complete as those requests will obviously bypass the device cache. + +Any REQ_DISCARD requests are treated like WRITE requests. This is because +otherwise we would have all the DISCARD requests, and then the WRITE requests +and then the FLUSH request. Consider the following example + +WRITE block 1, DISCARD block 1, FLUSH + +If we logged DISCARD when it completed, the replay would look like this + +DISCARD 1, WRITE 1, FLUSH + +which isn't quite what happened and wouldn't be caught during the log replay. + +Marks +===== + +You can use dmsetup to set an arbitrary mark in a log. For example say you want +to fsck an file system after every write, but first you need to replay up to the +mkfs to make sure we're fsck'ing something reasonable, you would do something +like this + +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs +<run test> + +This would allow you to replay the log up to the mkfs mark and then replay from +that point on doing the fsck check in the interval that you want. + +Every log has a mark at the end labeled "log-writes-end". + +Userspace component +=================== + +There is a userspace tool that will replay the log for you in various ways. +As of this writing the options are not well documented, they will be in the +future. It can be found here + +https://github.com/josefbacik/log-writes + +Example usage +============= + +Say you want to test fsync on your file system. You would do something like +this + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<some test that does fsync at the end> +dmsetup message log 0 mark fsync +md5sum /mnt/btrfs-test/foo +umount /mnt/btrfs-test + +dmsetup remove log +replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync +mount /dev/sdb /mnt/btrfs-test +md5sum /mnt/btrfs-test/foo +<verify md5sum's are correct> + +Another option is to do a complicated file system operation and verify the file +system is consistent during the entire operation. You could do this by doing + + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<fsstress to dirty the fs> +btrfs filesystem balance /mnt/btrfs-test +umount /mnt/btrfs-test +dmsetup remove log + +replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs +btrfsck /dev/sdb +replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ + --fsck "btrfsck /dev/sdb" --check fua + +And that will replay the log until it sees a FUA request, run the fsck command +and if the fsck passes it will replay to the next FUA, until it is completed or +the fsck command exists abnormally. + +Table Parameters +---------------- + <dev path> <dev path for log> + +Mandatory parameters: + <dev path>: Full pathname to the underlying block-device, or a "major:minor" + device-number. This device is the one that all of the IO will go + to normally, just think of it as a normal linear mapping. + <dev path for log>: Same format as <dev path>, this is the device where the + log entries are written to. + diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 63e05e3..f928ad5 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -432,4 +432,20 @@ config DM_SWITCH If unsure, say N. +config DM_LOG_WRITES + tristate "Log writes target support" + depends on BLK_DEV_DM + ---help--- + This device-mapper target takes two devices, one device to use + normally, one to log all write operations done to the first device. + This is for use by file system developers wishing to verify that + their fs is writing a consitent file system at all times by allowing + them to replay the log in a variety of ways and to check the + contents. + + To compile this code as a module, choose M here: the module will + be called dm-log-writes. + + If unsure, say N. + endif # MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index a2da532..1863fea 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -55,6 +55,7 @@ obj-$(CONFIG_DM_CACHE) += dm-cache.o obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o obj-$(CONFIG_DM_ERA) += dm-era.o +obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs += dm-uevent.o diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c new file mode 100644 index 0000000..067cb07 --- /dev/null +++ b/drivers/md/dm-log-writes.c @@ -0,0 +1,826 @@ +/* + * Copyright (C) 2014 Facebook. All rights reserved. + * + * This file is released under the GPL. + */ + +#include <linux/device-mapper.h> + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/blkdev.h> +#include <linux/bio.h> +#include <linux/slab.h> +#include <linux/kthread.h> +#include <linux/freezer.h> + +#define DM_MSG_PREFIX "log-writes" + +/* + * This target will log sequentially all writes to the target device onto the + * log device. This is helpful for replaying writes to check for fs consitency + * at all times. This target provides a mechanism to mark specific events to + * check data at a later time. So for example you would + * + * write data + * fsync + * dmsetup message /dev/whatever mark mymark + * unmount /mnt/test + * + * Then replay the log up to mymark and check the contents of the replay to + * verify it matches what was written. + * + * We log writes only after they have been flushed, this makes the log describe + * close to the order in which the data hits the actual disk, not its cache. So + * for example the following sequence (W means write, C means complete) + * + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd + * + * Would result in the log looking like this + * + * c,a,flush,fuad,b,<other writes>,<next flush> + * + * This is meant to help expose problems where file systems do not properly wait + * on data being written before invoking a FLUSH. FUA bypasses cache so once it + * completes it is added to the log as it should be on disk. + * + * We treat DISCARDs as if they don't bypass cache so that they are logged in + * order of completion along with the normal writes. If we didn't do it this + * way we would process all the discards first and then write all the data, when + * in fact we want to do the data and the discard in the order that they + * completed. + */ +#define LOG_FLUSH_FLAG (1 << 0) +#define LOG_FUA_FLAG (1 << 1) +#define LOG_DISCARD_FLAG (1 << 2) +#define LOG_MARK_FLAG (1 << 3) + +#define WRITE_LOG_VERSION 1 +#define WRITE_LOG_MAGIC 0x6a736677736872 + +/* + * The disk format for this is braindead simple. + * + * At byte 0 we have our super, followed by the following sequence for + * nr_entries + * + * [ 1 sector ][ entry->nr_sectors ] + * [log_write_entry][ data written ] + * + * The log_write_entry takes up a full sector so we can have arbitrary length + * marks and it leaves us room for extra content in the future. + */ + +/* + * Basic info about the log for userspace. + */ +struct log_write_super { + __le64 magic; + __le64 version; + __le64 nr_entries; + __le32 sectorsize; +}; + +/* + * sector - the sector we wrote. + * nr_sectors - the number of sectors we wrote. + * flags - flags for this log entry. + * data_len - the size of the data in this log entry, this is for private log + * entry stuff, the MARK data provided by userspace for example. + */ +struct log_write_entry { + __le64 sector; + __le64 nr_sectors; + __le64 flags; + __le64 data_len; +}; + +struct log_writes_c { + struct dm_dev *dev; + struct dm_dev *logdev; + u64 logged_entries; + u32 sectorsize; + atomic_t io_blocks; + atomic_t pending_blocks; + sector_t next_sector; + sector_t end_sector; + bool logging_enabled; + bool device_supports_discard; + spinlock_t blocks_lock; + struct list_head unflushed_blocks; + struct list_head logging_blocks; + wait_queue_head_t wait; + struct task_struct *log_kthread; +}; + +struct pending_block { + int vec_cnt; + u64 flags; + sector_t sector; + sector_t nr_sectors; + char *data; + u32 datalen; + struct list_head list; + struct bio_vec vecs[0]; +}; + +struct per_bio_data { + struct pending_block *block; +}; + +static void put_pending_block(struct log_writes_c *lc) +{ + if (atomic_dec_and_test(&lc->pending_blocks)) { + smp_mb__after_atomic(); + if (waitqueue_active(&lc->wait)) + wake_up(&lc->wait); + } +} + +static void put_io_block(struct log_writes_c *lc) +{ + if (atomic_dec_and_test(&lc->io_blocks)) { + smp_mb__after_atomic(); + if (waitqueue_active(&lc->wait)) + wake_up(&lc->wait); + } +} + +static void log_end_io(struct bio *bio, int err) +{ + struct log_writes_c *lc = bio->bi_private; + struct bio_vec *bvec; + int i; + + if (err) { + unsigned long flags; + + DMERR("Error writing log block %d", err); + spin_lock_irqsave(&lc->blocks_lock, flags); + lc->logging_enabled = false; + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + bio_for_each_segment_all(bvec, bio, i) + __free_page(bvec->bv_page); + + put_io_block(lc); + bio_put(bio); +} + +/* + * Meant to be called if there is an error, it will free all the pages + * associated with the block. + */ +static void free_pending_block(struct log_writes_c *lc, + struct pending_block *block) +{ + int i; + + for (i = 0; i < block->vec_cnt; i++) { + if (block->vecs[i].bv_page) + __free_page(block->vecs[i].bv_page); + } + kfree(block->data); + kfree(block); + put_pending_block(lc); +} + +static int write_metadata(struct log_writes_c *lc, void *entry, + size_t entrylen, void *data, size_t datalen, + sector_t sector) +{ + struct bio *bio; + struct page *page; + void *ptr; + size_t ret; + + bio = bio_alloc(GFP_KERNEL, 1); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + page = alloc_page(GFP_KERNEL); + if (!page) { + DMERR("Couldn't alloc log page"); + bio_put(bio); + goto error; + } + + ptr = kmap_atomic(page); + memcpy(ptr, entry, entrylen); + if (datalen) + memcpy(ptr + entrylen, data, datalen); + memset(ptr + entrylen + datalen, 0, + lc->sectorsize - entrylen - datalen); + kunmap_atomic(ptr); + + ret = bio_add_page(bio, page, lc->sectorsize, 0); + if (ret != lc->sectorsize) { + DMERR("Couldn't add page to the log block"); + goto error_bio; + } + submit_bio(WRITE, bio); + return 0; +error_bio: + bio_put(bio); + __free_page(page); +error: + put_io_block(lc); + return -1; +} + +static int log_one_block(struct log_writes_c *lc, + struct pending_block *block, sector_t sector) +{ + struct bio *bio; + struct log_write_entry entry; + size_t ret; + int i; + + entry.sector = cpu_to_le64(block->sector); + entry.nr_sectors = cpu_to_le64(block->nr_sectors); + entry.flags = cpu_to_le64(block->flags); + entry.data_len = cpu_to_le64(block->datalen); + if (write_metadata(lc, &entry, sizeof(entry), block->data, + block->datalen, sector)) { + free_pending_block(lc, block); + return -1; + } + + if (!block->vec_cnt) + goto out; + sector++; + + bio = bio_alloc(GFP_KERNEL, block->vec_cnt); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + atomic_inc(&lc->io_blocks); + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + for (i = 0; i < block->vec_cnt; i++) { + /* + * The page offset is always 0 because we allocate a new page + * for every bvec in the original bio for simplicity sake. + */ + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + atomic_inc(&lc->io_blocks); + submit_bio(WRITE, bio); + bio = bio_alloc(GFP_KERNEL, block->vec_cnt - i); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + DMERR("Couldn't add page on new bio?"); + bio_put(bio); + goto error; + } + } + sector += block->vecs[i].bv_len >> SECTOR_SHIFT; + } + submit_bio(WRITE, bio); +out: + kfree(block->data); + kfree(block); + put_pending_block(lc); + return 0; +error: + free_pending_block(lc, block); + put_io_block(lc); + return -1; +} + +static int log_super(struct log_writes_c *lc) +{ + struct log_write_super super; + + super.magic = cpu_to_le64(WRITE_LOG_MAGIC); + super.version = cpu_to_le64(WRITE_LOG_VERSION); + super.nr_entries = cpu_to_le64(lc->logged_entries); + super.sectorsize = cpu_to_le32(lc->sectorsize); + + if (write_metadata(lc, &super, sizeof(super), NULL, 0, 0)) { + DMERR("Couldn't write super"); + return -1; + } + + return 0; +} + +static inline sector_t logdev_last_sector(struct log_writes_c *lc) +{ + return i_size_read(lc->logdev->bdev->bd_inode) >> SECTOR_SHIFT; +} + +static int log_writes_kthread(void *arg) +{ + struct log_writes_c *lc = (struct log_writes_c *)arg; + sector_t sector = 0; + + while (!kthread_should_stop()) { + bool super = false; + bool logging_enabled; + struct pending_block *block = NULL; + int ret; + + spin_lock_irq(&lc->blocks_lock); + if (!list_empty(&lc->logging_blocks)) { + block = list_first_entry(&lc->logging_blocks, + struct pending_block, list); + list_del_init(&block->list); + if (!lc->logging_enabled) + goto next; + + sector = lc->next_sector; + if (block->flags & LOG_DISCARD_FLAG) + lc->next_sector++; + else + lc->next_sector += block->nr_sectors + 1; + + /* + * Apparently the size of the device may not be known + * right away, so handle this properly. + */ + if (!lc->end_sector) + lc->end_sector = logdev_last_sector(lc); + if (lc->end_sector && + lc->next_sector >= lc->end_sector) { + DMERR("Ran out of space on the logdev"); + lc->logging_enabled = false; + goto next; + } + lc->logged_entries++; + atomic_inc(&lc->io_blocks); + + super = (block->flags & (LOG_FUA_FLAG | LOG_MARK_FLAG)); + if (super) + atomic_inc(&lc->io_blocks); + } +next: + logging_enabled = lc->logging_enabled; + spin_unlock_irq(&lc->blocks_lock); + if (block) { + if (logging_enabled) { + ret = log_one_block(lc, block, sector); + if (!ret && super) + ret = log_super(lc); + if (ret) { + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + } + } else + free_pending_block(lc, block); + continue; + } + + if (!try_to_freeze()) { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop() && + !atomic_read(&lc->pending_blocks)) + schedule(); + __set_current_state(TASK_RUNNING); + } + } + return 0; +} + +/* + * Construct a log-writes mapping: + * log-writes <dev_path> <log_dev_path> + */ +static int log_writes_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct log_writes_c *lc; + struct dm_arg_set as; + const char *devname, *logdevname; + + as.argc = argc; + as.argv = argv; + + if (argc < 2) { + ti->error = "Invalid argument count"; + return -EINVAL; + } + + lc = kzalloc(sizeof(struct log_writes_c), GFP_KERNEL); + if (!lc) { + ti->error = "Cannot allocate context"; + return -ENOMEM; + } + spin_lock_init(&lc->blocks_lock); + INIT_LIST_HEAD(&lc->unflushed_blocks); + INIT_LIST_HEAD(&lc->logging_blocks); + init_waitqueue_head(&lc->wait); + lc->sectorsize = 1 << SECTOR_SHIFT; + atomic_set(&lc->io_blocks, 0); + atomic_set(&lc->pending_blocks, 0); + + devname = dm_shift_arg(&as); + if (dm_get_device(ti, devname, dm_table_get_mode(ti->table), &lc->dev)) { + ti->error = "Device lookup failed"; + goto bad; + } + + logdevname = dm_shift_arg(&as); + if (dm_get_device(ti, logdevname, dm_table_get_mode(ti->table), &lc->logdev)) { + ti->error = "Log device lookup failed"; + dm_put_device(ti, lc->dev); + goto bad; + } + + lc->log_kthread = kthread_run(log_writes_kthread, lc, "log-write"); + if (!lc->log_kthread) { + ti->error = "Couldn't alloc kthread"; + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + goto bad; + } + + /* We put the super at sector 0, start logging at sector 1 */ + lc->next_sector = 1; + lc->logging_enabled = true; + lc->end_sector = logdev_last_sector(lc); + lc->device_supports_discard = true; + + ti->num_flush_bios = 1; + ti->flush_supported = true; + ti->num_discard_bios = 1; + ti->discards_supported = true; + ti->per_bio_data_size = sizeof(struct per_bio_data); + ti->private = lc; + return 0; + +bad: + kfree(lc); + return -EINVAL; +} + +static int log_mark(struct log_writes_c *lc, char *data) +{ + struct pending_block *block; + size_t maxsize = lc->sectorsize - sizeof(struct log_write_entry); + + block = kzalloc(sizeof(struct pending_block), GFP_KERNEL); + if (!block) { + DMERR("Error allocating pending block"); + return -ENOMEM; + } + + block->data = kstrndup(data, maxsize, GFP_KERNEL); + if (!block->data) { + DMERR("Error copying mark data"); + kfree(block); + return -ENOMEM; + } + atomic_inc(&lc->pending_blocks); + block->datalen = strlen(block->data); + block->flags |= LOG_MARK_FLAG; + spin_lock_irq(&lc->blocks_lock); + list_add_tail(&block->list, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + wake_up_process(lc->log_kthread); + return 0; +} + +static void log_writes_dtr(struct dm_target *ti) +{ + struct log_writes_c *lc = ti->private; + + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + + /* + * This is just nice to have since it'll update the super to include the + * unflushed blocks, if it fails we don't really care. + */ + log_mark(lc, "dm-log-writes-end"); + wake_up_process(lc->log_kthread); + wait_event(lc->wait, !atomic_read(&lc->io_blocks) && + !atomic_read(&lc->pending_blocks)); + kthread_stop(lc->log_kthread); + + WARN_ON(!list_empty(&lc->logging_blocks)); + WARN_ON(!list_empty(&lc->unflushed_blocks)); + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + kfree(lc); +} + +static void normal_map_bio(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + + bio->bi_bdev = lc->dev->bdev; +} + +static int log_writes_map(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + struct pending_block *block; + struct bvec_iter iter; + struct bio_vec bv; + size_t alloc_size; + int i = 0; + bool flush_bio = (bio->bi_rw & REQ_FLUSH); + bool fua_bio = (bio->bi_rw & REQ_FUA); + bool discard_bio = (bio->bi_rw & REQ_DISCARD); + + pb->block = NULL; + + /* Don't bother doing anything if logging has been disabled */ + if (!lc->logging_enabled) + goto map_bio; + + /* + * Map reads as normal. + */ + if (bio_data_dir(bio) == READ) + goto map_bio; + + /* No sectors and not a flush? Don't care */ + if (!bio_sectors(bio) && !flush_bio) + goto map_bio; + + /* + * Discards will have bi_size set but there's no actual data, so just + * allocate the size of the pending block. + */ + if (discard_bio) + alloc_size = sizeof(struct pending_block); + else + alloc_size = sizeof(struct pending_block) + sizeof(struct bio_vec) * bio_segments(bio); + + block = kzalloc(alloc_size, GFP_NOIO); + if (!block) { + DMERR("Error allocating pending block"); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + INIT_LIST_HEAD(&block->list); + pb->block = block; + atomic_inc(&lc->pending_blocks); + + if (flush_bio) + block->flags |= LOG_FLUSH_FLAG; + if (fua_bio) + block->flags |= LOG_FUA_FLAG; + if (discard_bio) + block->flags |= LOG_DISCARD_FLAG; + + block->sector = bio->bi_iter.bi_sector; + block->nr_sectors = bio_sectors(bio); + + /* We don't need the data, just submit */ + if (discard_bio) { + WARN_ON(flush_bio || fua_bio); + if (lc->device_supports_discard) + goto map_bio; + bio_endio(bio, 0); + return DM_MAPIO_SUBMITTED; + } + + /* Flush bio, splice the unflushed blocks onto this list and submit */ + if (flush_bio && !bio_sectors(bio)) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + goto map_bio; + } + + /* + * We will write this bio somewhere else way later so we need to copy + * the actual contents into new pages so we know the data will always be + * there. + * + * We do this because this could be a bio from O_DIRECT in which case we + * can't just hold onto the page until some later point, we have to + * manually copy the contents. + */ + bio_for_each_segment(bv, bio, iter) { + struct page *page; + void *src, *dst; + + page = alloc_page(GFP_NOIO); + if (!page) { + DMERR("Error allocing page"); + free_pending_block(lc, block); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + + src = kmap_atomic(bv.bv_page); + dst = kmap_atomic(page); + memcpy(dst, src + bv.bv_offset, bv.bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + block->vecs[i].bv_page = page; + block->vecs[i].bv_len = bv.bv_len; + block->vec_cnt++; + i++; + } + + /* Had a flush with data in it, weird */ + if (flush_bio) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + } +map_bio: + normal_map_bio(ti, bio); + return DM_MAPIO_REMAPPED; +} + +static int normal_end_io(struct dm_target *ti, struct bio *bio, int error) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + + if (bio_data_dir(bio) == WRITE && pb->block) { + struct pending_block *block = pb->block; + unsigned long flags; + + spin_lock_irqsave(&lc->blocks_lock, flags); + if (block->flags & LOG_FLUSH_FLAG) { + list_splice_tail_init(&block->list, &lc->logging_blocks); + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else if (block->flags & LOG_FUA_FLAG) { + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else + list_add_tail(&block->list, &lc->unflushed_blocks); + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + return error; +} + +/* + * INFO format: <logged entries> <highest allocated sector> + */ +static void log_writes_status(struct dm_target *ti, status_type_t type, + unsigned status_flags, char *result, + unsigned maxlen) +{ + unsigned sz = 0; + struct log_writes_c *lc = ti->private; + + switch (type) { + case STATUSTYPE_INFO: + DMEMIT("%llu %llu", lc->logged_entries, + (unsigned long long)lc->next_sector - 1); + if (!lc->logging_enabled) + DMEMIT(" logging_disabled"); + break; + + case STATUSTYPE_TABLE: + DMEMIT("%s %s", lc->dev->name, lc->logdev->name); + break; + } +} + +static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd, + unsigned long arg) +{ + struct log_writes_c *lc = ti->private; + struct dm_dev *dev = lc->dev; + int r = 0; + + /* + * Only pass ioctls through if the device sizes match exactly. + */ + if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT) + r = scsi_verify_blk_ioctl(NULL, cmd); + + return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg); +} + +static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = lc->dev->bdev; + bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector); + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static int log_writes_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, + void *data) +{ + struct log_writes_c *lc = ti->private; + + return fn(ti, lc->dev, 0, ti->len, data); +} + +/* + * Messages supported: + * mark <mark data> - specify the marked data. + */ +static int log_writes_message(struct dm_target *ti, unsigned argc, char **argv) +{ + int r = -EINVAL; + struct log_writes_c *lc = ti->private; + + if (argc != 2) { + DMWARN("Invalid log-writes message arguments, expect 2 arguments, got %d", argc); + return r; + } + + if (!strcasecmp(argv[0], "mark")) + r = log_mark(lc, argv[1]); + else + DMWARN("Unrecognised log writes target message received: %s", argv[0]); + + return r; +} + +static void log_writes_io_hints(struct dm_target *ti, struct queue_limits *limits) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q || !blk_queue_discard(q)) { + lc->device_supports_discard = false; + limits->discard_granularity = 1 << SECTOR_SHIFT; + limits->max_discard_sectors = (UINT_MAX >> SECTOR_SHIFT); + } +} + +static struct target_type log_writes_target = { + .name = "log-writes", + .version = {1, 0, 0}, + .module = THIS_MODULE, + .ctr = log_writes_ctr, + .dtr = log_writes_dtr, + .map = log_writes_map, + .end_io = normal_end_io, + .status = log_writes_status, + .ioctl = log_writes_ioctl, + .merge = log_writes_merge, + .message = log_writes_message, + .iterate_devices = log_writes_iterate_devices, + .io_hints = log_writes_io_hints, +}; + +static int __init dm_log_writes_init(void) +{ + int r = dm_register_target(&log_writes_target); + + if (r < 0) + DMERR("register failed %d", r); + + return r; +} + +static void __exit dm_log_writes_exit(void) +{ + dm_unregister_target(&log_writes_target); +} + +/* Module hooks */ +module_init(dm_log_writes_init); +module_exit(dm_log_writes_exit); + +MODULE_DESCRIPTION(DM_NAME " log writes target"); +MODULE_AUTHOR("Josef Bacik <jbacik@fb.com>"); +MODULE_LICENSE("GPL"); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 1/3] dm: log writes target V2 @ 2015-03-20 14:50 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-20 14:50 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, fstests, zab This creates a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and the data and log the requests and the data to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers to verify that the file system is always consistent. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- V1->V2: fixed up stuff based on Zachs review. Documentation/device-mapper/dm-log-writes.txt | 136 +++++ drivers/md/Kconfig | 16 + drivers/md/Makefile | 1 + drivers/md/dm-log-writes.c | 826 ++++++++++++++++++++++++++ 4 files changed, 979 insertions(+) create mode 100644 Documentation/device-mapper/dm-log-writes.txt create mode 100644 drivers/md/dm-log-writes.c diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt new file mode 100644 index 0000000..f3a9fa2 --- /dev/null +++ b/Documentation/device-mapper/dm-log-writes.txt @@ -0,0 +1,136 @@ +dm-log-writes +============= + +This target takes 2 devices, one to pass all IO to normally, and one to log all +of the write operations to. This is intended for file system developers wishing +to verify the integrity of metadata or data as the file system is written to. +There is a log_writes_entry written for every WRITE request and the target is +able to take arbitrary data from userspace to insert into the log. The data +that is in the WRITE requests is copied into the log to make the replay happen +exactly as it happened originally. + +Log Ordering +============ + +We log things in order of completion once we are sure the write is no longer in +cache. This means that normal WRITE requests are not actually logged until the +next REQ_FLUSH request. This is to make it easier for userspace to replay the +log in a way that correlates to what is on disk and not what is in cache, to +make it easier to detect improper waiting/flushing. + +This works by attaching all WRITE requests to a list once the write completes. +Once we see a REQ_FLUSH request we splice this list onto the request and once +the FLUSH request completes we log all of the WRITE's and then the FLUSH. Only +completeled WRITEs at the time of the issue of the REQ_FLUSH are added in order +to simulate the worst case scenario with regard to power failures. Consider the +following example (W means write, C means complete) + +W1,W2,W3,C3,C2,Wflush,C1,Cflush + +The log would show the following + +W3,W2,flush,W1.... + +Again this is to simulate what is actually on disk, this allows us to detect +cases where a power failure at a particular point in time would create an +inconsistent file system. + +Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as +they complete as those requests will obviously bypass the device cache. + +Any REQ_DISCARD requests are treated like WRITE requests. This is because +otherwise we would have all the DISCARD requests, and then the WRITE requests +and then the FLUSH request. Consider the following example + +WRITE block 1, DISCARD block 1, FLUSH + +If we logged DISCARD when it completed, the replay would look like this + +DISCARD 1, WRITE 1, FLUSH + +which isn't quite what happened and wouldn't be caught during the log replay. + +Marks +===== + +You can use dmsetup to set an arbitrary mark in a log. For example say you want +to fsck an file system after every write, but first you need to replay up to the +mkfs to make sure we're fsck'ing something reasonable, you would do something +like this + +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs +<run test> + +This would allow you to replay the log up to the mkfs mark and then replay from +that point on doing the fsck check in the interval that you want. + +Every log has a mark at the end labeled "log-writes-end". + +Userspace component +=================== + +There is a userspace tool that will replay the log for you in various ways. +As of this writing the options are not well documented, they will be in the +future. It can be found here + +https://github.com/josefbacik/log-writes + +Example usage +============= + +Say you want to test fsync on your file system. You would do something like +this + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<some test that does fsync at the end> +dmsetup message log 0 mark fsync +md5sum /mnt/btrfs-test/foo +umount /mnt/btrfs-test + +dmsetup remove log +replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync +mount /dev/sdb /mnt/btrfs-test +md5sum /mnt/btrfs-test/foo +<verify md5sum's are correct> + +Another option is to do a complicated file system operation and verify the file +system is consistent during the entire operation. You could do this by doing + + +TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" +dmsetup create log --table "$TABLE" +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + +mount /dev/mapper/log /mnt/btrfs-test +<fsstress to dirty the fs> +btrfs filesystem balance /mnt/btrfs-test +umount /mnt/btrfs-test +dmsetup remove log + +replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs +btrfsck /dev/sdb +replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ + --fsck "btrfsck /dev/sdb" --check fua + +And that will replay the log until it sees a FUA request, run the fsck command +and if the fsck passes it will replay to the next FUA, until it is completed or +the fsck command exists abnormally. + +Table Parameters +---------------- + <dev path> <dev path for log> + +Mandatory parameters: + <dev path>: Full pathname to the underlying block-device, or a "major:minor" + device-number. This device is the one that all of the IO will go + to normally, just think of it as a normal linear mapping. + <dev path for log>: Same format as <dev path>, this is the device where the + log entries are written to. + diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 63e05e3..f928ad5 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -432,4 +432,20 @@ config DM_SWITCH If unsure, say N. +config DM_LOG_WRITES + tristate "Log writes target support" + depends on BLK_DEV_DM + ---help--- + This device-mapper target takes two devices, one device to use + normally, one to log all write operations done to the first device. + This is for use by file system developers wishing to verify that + their fs is writing a consitent file system at all times by allowing + them to replay the log in a variety of ways and to check the + contents. + + To compile this code as a module, choose M here: the module will + be called dm-log-writes. + + If unsure, say N. + endif # MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index a2da532..1863fea 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -55,6 +55,7 @@ obj-$(CONFIG_DM_CACHE) += dm-cache.o obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o obj-$(CONFIG_DM_ERA) += dm-era.o +obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs += dm-uevent.o diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c new file mode 100644 index 0000000..067cb07 --- /dev/null +++ b/drivers/md/dm-log-writes.c @@ -0,0 +1,826 @@ +/* + * Copyright (C) 2014 Facebook. All rights reserved. + * + * This file is released under the GPL. + */ + +#include <linux/device-mapper.h> + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/blkdev.h> +#include <linux/bio.h> +#include <linux/slab.h> +#include <linux/kthread.h> +#include <linux/freezer.h> + +#define DM_MSG_PREFIX "log-writes" + +/* + * This target will log sequentially all writes to the target device onto the + * log device. This is helpful for replaying writes to check for fs consitency + * at all times. This target provides a mechanism to mark specific events to + * check data at a later time. So for example you would + * + * write data + * fsync + * dmsetup message /dev/whatever mark mymark + * unmount /mnt/test + * + * Then replay the log up to mymark and check the contents of the replay to + * verify it matches what was written. + * + * We log writes only after they have been flushed, this makes the log describe + * close to the order in which the data hits the actual disk, not its cache. So + * for example the following sequence (W means write, C means complete) + * + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd + * + * Would result in the log looking like this + * + * c,a,flush,fuad,b,<other writes>,<next flush> + * + * This is meant to help expose problems where file systems do not properly wait + * on data being written before invoking a FLUSH. FUA bypasses cache so once it + * completes it is added to the log as it should be on disk. + * + * We treat DISCARDs as if they don't bypass cache so that they are logged in + * order of completion along with the normal writes. If we didn't do it this + * way we would process all the discards first and then write all the data, when + * in fact we want to do the data and the discard in the order that they + * completed. + */ +#define LOG_FLUSH_FLAG (1 << 0) +#define LOG_FUA_FLAG (1 << 1) +#define LOG_DISCARD_FLAG (1 << 2) +#define LOG_MARK_FLAG (1 << 3) + +#define WRITE_LOG_VERSION 1 +#define WRITE_LOG_MAGIC 0x6a736677736872 + +/* + * The disk format for this is braindead simple. + * + * At byte 0 we have our super, followed by the following sequence for + * nr_entries + * + * [ 1 sector ][ entry->nr_sectors ] + * [log_write_entry][ data written ] + * + * The log_write_entry takes up a full sector so we can have arbitrary length + * marks and it leaves us room for extra content in the future. + */ + +/* + * Basic info about the log for userspace. + */ +struct log_write_super { + __le64 magic; + __le64 version; + __le64 nr_entries; + __le32 sectorsize; +}; + +/* + * sector - the sector we wrote. + * nr_sectors - the number of sectors we wrote. + * flags - flags for this log entry. + * data_len - the size of the data in this log entry, this is for private log + * entry stuff, the MARK data provided by userspace for example. + */ +struct log_write_entry { + __le64 sector; + __le64 nr_sectors; + __le64 flags; + __le64 data_len; +}; + +struct log_writes_c { + struct dm_dev *dev; + struct dm_dev *logdev; + u64 logged_entries; + u32 sectorsize; + atomic_t io_blocks; + atomic_t pending_blocks; + sector_t next_sector; + sector_t end_sector; + bool logging_enabled; + bool device_supports_discard; + spinlock_t blocks_lock; + struct list_head unflushed_blocks; + struct list_head logging_blocks; + wait_queue_head_t wait; + struct task_struct *log_kthread; +}; + +struct pending_block { + int vec_cnt; + u64 flags; + sector_t sector; + sector_t nr_sectors; + char *data; + u32 datalen; + struct list_head list; + struct bio_vec vecs[0]; +}; + +struct per_bio_data { + struct pending_block *block; +}; + +static void put_pending_block(struct log_writes_c *lc) +{ + if (atomic_dec_and_test(&lc->pending_blocks)) { + smp_mb__after_atomic(); + if (waitqueue_active(&lc->wait)) + wake_up(&lc->wait); + } +} + +static void put_io_block(struct log_writes_c *lc) +{ + if (atomic_dec_and_test(&lc->io_blocks)) { + smp_mb__after_atomic(); + if (waitqueue_active(&lc->wait)) + wake_up(&lc->wait); + } +} + +static void log_end_io(struct bio *bio, int err) +{ + struct log_writes_c *lc = bio->bi_private; + struct bio_vec *bvec; + int i; + + if (err) { + unsigned long flags; + + DMERR("Error writing log block %d", err); + spin_lock_irqsave(&lc->blocks_lock, flags); + lc->logging_enabled = false; + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + bio_for_each_segment_all(bvec, bio, i) + __free_page(bvec->bv_page); + + put_io_block(lc); + bio_put(bio); +} + +/* + * Meant to be called if there is an error, it will free all the pages + * associated with the block. + */ +static void free_pending_block(struct log_writes_c *lc, + struct pending_block *block) +{ + int i; + + for (i = 0; i < block->vec_cnt; i++) { + if (block->vecs[i].bv_page) + __free_page(block->vecs[i].bv_page); + } + kfree(block->data); + kfree(block); + put_pending_block(lc); +} + +static int write_metadata(struct log_writes_c *lc, void *entry, + size_t entrylen, void *data, size_t datalen, + sector_t sector) +{ + struct bio *bio; + struct page *page; + void *ptr; + size_t ret; + + bio = bio_alloc(GFP_KERNEL, 1); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + page = alloc_page(GFP_KERNEL); + if (!page) { + DMERR("Couldn't alloc log page"); + bio_put(bio); + goto error; + } + + ptr = kmap_atomic(page); + memcpy(ptr, entry, entrylen); + if (datalen) + memcpy(ptr + entrylen, data, datalen); + memset(ptr + entrylen + datalen, 0, + lc->sectorsize - entrylen - datalen); + kunmap_atomic(ptr); + + ret = bio_add_page(bio, page, lc->sectorsize, 0); + if (ret != lc->sectorsize) { + DMERR("Couldn't add page to the log block"); + goto error_bio; + } + submit_bio(WRITE, bio); + return 0; +error_bio: + bio_put(bio); + __free_page(page); +error: + put_io_block(lc); + return -1; +} + +static int log_one_block(struct log_writes_c *lc, + struct pending_block *block, sector_t sector) +{ + struct bio *bio; + struct log_write_entry entry; + size_t ret; + int i; + + entry.sector = cpu_to_le64(block->sector); + entry.nr_sectors = cpu_to_le64(block->nr_sectors); + entry.flags = cpu_to_le64(block->flags); + entry.data_len = cpu_to_le64(block->datalen); + if (write_metadata(lc, &entry, sizeof(entry), block->data, + block->datalen, sector)) { + free_pending_block(lc, block); + return -1; + } + + if (!block->vec_cnt) + goto out; + sector++; + + bio = bio_alloc(GFP_KERNEL, block->vec_cnt); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + atomic_inc(&lc->io_blocks); + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + for (i = 0; i < block->vec_cnt; i++) { + /* + * The page offset is always 0 because we allocate a new page + * for every bvec in the original bio for simplicity sake. + */ + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + atomic_inc(&lc->io_blocks); + submit_bio(WRITE, bio); + bio = bio_alloc(GFP_KERNEL, block->vec_cnt - i); + if (!bio) { + DMERR("Couldn't alloc log bio"); + goto error; + } + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = lc->logdev->bdev; + bio->bi_end_io = log_end_io; + bio->bi_private = lc; + set_bit(BIO_UPTODATE, &bio->bi_flags); + + ret = bio_add_page(bio, block->vecs[i].bv_page, + block->vecs[i].bv_len, 0); + if (ret != block->vecs[i].bv_len) { + DMERR("Couldn't add page on new bio?"); + bio_put(bio); + goto error; + } + } + sector += block->vecs[i].bv_len >> SECTOR_SHIFT; + } + submit_bio(WRITE, bio); +out: + kfree(block->data); + kfree(block); + put_pending_block(lc); + return 0; +error: + free_pending_block(lc, block); + put_io_block(lc); + return -1; +} + +static int log_super(struct log_writes_c *lc) +{ + struct log_write_super super; + + super.magic = cpu_to_le64(WRITE_LOG_MAGIC); + super.version = cpu_to_le64(WRITE_LOG_VERSION); + super.nr_entries = cpu_to_le64(lc->logged_entries); + super.sectorsize = cpu_to_le32(lc->sectorsize); + + if (write_metadata(lc, &super, sizeof(super), NULL, 0, 0)) { + DMERR("Couldn't write super"); + return -1; + } + + return 0; +} + +static inline sector_t logdev_last_sector(struct log_writes_c *lc) +{ + return i_size_read(lc->logdev->bdev->bd_inode) >> SECTOR_SHIFT; +} + +static int log_writes_kthread(void *arg) +{ + struct log_writes_c *lc = (struct log_writes_c *)arg; + sector_t sector = 0; + + while (!kthread_should_stop()) { + bool super = false; + bool logging_enabled; + struct pending_block *block = NULL; + int ret; + + spin_lock_irq(&lc->blocks_lock); + if (!list_empty(&lc->logging_blocks)) { + block = list_first_entry(&lc->logging_blocks, + struct pending_block, list); + list_del_init(&block->list); + if (!lc->logging_enabled) + goto next; + + sector = lc->next_sector; + if (block->flags & LOG_DISCARD_FLAG) + lc->next_sector++; + else + lc->next_sector += block->nr_sectors + 1; + + /* + * Apparently the size of the device may not be known + * right away, so handle this properly. + */ + if (!lc->end_sector) + lc->end_sector = logdev_last_sector(lc); + if (lc->end_sector && + lc->next_sector >= lc->end_sector) { + DMERR("Ran out of space on the logdev"); + lc->logging_enabled = false; + goto next; + } + lc->logged_entries++; + atomic_inc(&lc->io_blocks); + + super = (block->flags & (LOG_FUA_FLAG | LOG_MARK_FLAG)); + if (super) + atomic_inc(&lc->io_blocks); + } +next: + logging_enabled = lc->logging_enabled; + spin_unlock_irq(&lc->blocks_lock); + if (block) { + if (logging_enabled) { + ret = log_one_block(lc, block, sector); + if (!ret && super) + ret = log_super(lc); + if (ret) { + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + } + } else + free_pending_block(lc, block); + continue; + } + + if (!try_to_freeze()) { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop() && + !atomic_read(&lc->pending_blocks)) + schedule(); + __set_current_state(TASK_RUNNING); + } + } + return 0; +} + +/* + * Construct a log-writes mapping: + * log-writes <dev_path> <log_dev_path> + */ +static int log_writes_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct log_writes_c *lc; + struct dm_arg_set as; + const char *devname, *logdevname; + + as.argc = argc; + as.argv = argv; + + if (argc < 2) { + ti->error = "Invalid argument count"; + return -EINVAL; + } + + lc = kzalloc(sizeof(struct log_writes_c), GFP_KERNEL); + if (!lc) { + ti->error = "Cannot allocate context"; + return -ENOMEM; + } + spin_lock_init(&lc->blocks_lock); + INIT_LIST_HEAD(&lc->unflushed_blocks); + INIT_LIST_HEAD(&lc->logging_blocks); + init_waitqueue_head(&lc->wait); + lc->sectorsize = 1 << SECTOR_SHIFT; + atomic_set(&lc->io_blocks, 0); + atomic_set(&lc->pending_blocks, 0); + + devname = dm_shift_arg(&as); + if (dm_get_device(ti, devname, dm_table_get_mode(ti->table), &lc->dev)) { + ti->error = "Device lookup failed"; + goto bad; + } + + logdevname = dm_shift_arg(&as); + if (dm_get_device(ti, logdevname, dm_table_get_mode(ti->table), &lc->logdev)) { + ti->error = "Log device lookup failed"; + dm_put_device(ti, lc->dev); + goto bad; + } + + lc->log_kthread = kthread_run(log_writes_kthread, lc, "log-write"); + if (!lc->log_kthread) { + ti->error = "Couldn't alloc kthread"; + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + goto bad; + } + + /* We put the super at sector 0, start logging at sector 1 */ + lc->next_sector = 1; + lc->logging_enabled = true; + lc->end_sector = logdev_last_sector(lc); + lc->device_supports_discard = true; + + ti->num_flush_bios = 1; + ti->flush_supported = true; + ti->num_discard_bios = 1; + ti->discards_supported = true; + ti->per_bio_data_size = sizeof(struct per_bio_data); + ti->private = lc; + return 0; + +bad: + kfree(lc); + return -EINVAL; +} + +static int log_mark(struct log_writes_c *lc, char *data) +{ + struct pending_block *block; + size_t maxsize = lc->sectorsize - sizeof(struct log_write_entry); + + block = kzalloc(sizeof(struct pending_block), GFP_KERNEL); + if (!block) { + DMERR("Error allocating pending block"); + return -ENOMEM; + } + + block->data = kstrndup(data, maxsize, GFP_KERNEL); + if (!block->data) { + DMERR("Error copying mark data"); + kfree(block); + return -ENOMEM; + } + atomic_inc(&lc->pending_blocks); + block->datalen = strlen(block->data); + block->flags |= LOG_MARK_FLAG; + spin_lock_irq(&lc->blocks_lock); + list_add_tail(&block->list, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + wake_up_process(lc->log_kthread); + return 0; +} + +static void log_writes_dtr(struct dm_target *ti) +{ + struct log_writes_c *lc = ti->private; + + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &lc->logging_blocks); + spin_unlock_irq(&lc->blocks_lock); + + /* + * This is just nice to have since it'll update the super to include the + * unflushed blocks, if it fails we don't really care. + */ + log_mark(lc, "dm-log-writes-end"); + wake_up_process(lc->log_kthread); + wait_event(lc->wait, !atomic_read(&lc->io_blocks) && + !atomic_read(&lc->pending_blocks)); + kthread_stop(lc->log_kthread); + + WARN_ON(!list_empty(&lc->logging_blocks)); + WARN_ON(!list_empty(&lc->unflushed_blocks)); + dm_put_device(ti, lc->dev); + dm_put_device(ti, lc->logdev); + kfree(lc); +} + +static void normal_map_bio(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + + bio->bi_bdev = lc->dev->bdev; +} + +static int log_writes_map(struct dm_target *ti, struct bio *bio) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + struct pending_block *block; + struct bvec_iter iter; + struct bio_vec bv; + size_t alloc_size; + int i = 0; + bool flush_bio = (bio->bi_rw & REQ_FLUSH); + bool fua_bio = (bio->bi_rw & REQ_FUA); + bool discard_bio = (bio->bi_rw & REQ_DISCARD); + + pb->block = NULL; + + /* Don't bother doing anything if logging has been disabled */ + if (!lc->logging_enabled) + goto map_bio; + + /* + * Map reads as normal. + */ + if (bio_data_dir(bio) == READ) + goto map_bio; + + /* No sectors and not a flush? Don't care */ + if (!bio_sectors(bio) && !flush_bio) + goto map_bio; + + /* + * Discards will have bi_size set but there's no actual data, so just + * allocate the size of the pending block. + */ + if (discard_bio) + alloc_size = sizeof(struct pending_block); + else + alloc_size = sizeof(struct pending_block) + sizeof(struct bio_vec) * bio_segments(bio); + + block = kzalloc(alloc_size, GFP_NOIO); + if (!block) { + DMERR("Error allocating pending block"); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + INIT_LIST_HEAD(&block->list); + pb->block = block; + atomic_inc(&lc->pending_blocks); + + if (flush_bio) + block->flags |= LOG_FLUSH_FLAG; + if (fua_bio) + block->flags |= LOG_FUA_FLAG; + if (discard_bio) + block->flags |= LOG_DISCARD_FLAG; + + block->sector = bio->bi_iter.bi_sector; + block->nr_sectors = bio_sectors(bio); + + /* We don't need the data, just submit */ + if (discard_bio) { + WARN_ON(flush_bio || fua_bio); + if (lc->device_supports_discard) + goto map_bio; + bio_endio(bio, 0); + return DM_MAPIO_SUBMITTED; + } + + /* Flush bio, splice the unflushed blocks onto this list and submit */ + if (flush_bio && !bio_sectors(bio)) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + goto map_bio; + } + + /* + * We will write this bio somewhere else way later so we need to copy + * the actual contents into new pages so we know the data will always be + * there. + * + * We do this because this could be a bio from O_DIRECT in which case we + * can't just hold onto the page until some later point, we have to + * manually copy the contents. + */ + bio_for_each_segment(bv, bio, iter) { + struct page *page; + void *src, *dst; + + page = alloc_page(GFP_NOIO); + if (!page) { + DMERR("Error allocing page"); + free_pending_block(lc, block); + spin_lock_irq(&lc->blocks_lock); + lc->logging_enabled = false; + spin_unlock_irq(&lc->blocks_lock); + return -ENOMEM; + } + + src = kmap_atomic(bv.bv_page); + dst = kmap_atomic(page); + memcpy(dst, src + bv.bv_offset, bv.bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + block->vecs[i].bv_page = page; + block->vecs[i].bv_len = bv.bv_len; + block->vec_cnt++; + i++; + } + + /* Had a flush with data in it, weird */ + if (flush_bio) { + spin_lock_irq(&lc->blocks_lock); + list_splice_init(&lc->unflushed_blocks, &block->list); + spin_unlock_irq(&lc->blocks_lock); + } +map_bio: + normal_map_bio(ti, bio); + return DM_MAPIO_REMAPPED; +} + +static int normal_end_io(struct dm_target *ti, struct bio *bio, int error) +{ + struct log_writes_c *lc = ti->private; + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); + + if (bio_data_dir(bio) == WRITE && pb->block) { + struct pending_block *block = pb->block; + unsigned long flags; + + spin_lock_irqsave(&lc->blocks_lock, flags); + if (block->flags & LOG_FLUSH_FLAG) { + list_splice_tail_init(&block->list, &lc->logging_blocks); + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else if (block->flags & LOG_FUA_FLAG) { + list_add_tail(&block->list, &lc->logging_blocks); + wake_up_process(lc->log_kthread); + } else + list_add_tail(&block->list, &lc->unflushed_blocks); + spin_unlock_irqrestore(&lc->blocks_lock, flags); + } + + return error; +} + +/* + * INFO format: <logged entries> <highest allocated sector> + */ +static void log_writes_status(struct dm_target *ti, status_type_t type, + unsigned status_flags, char *result, + unsigned maxlen) +{ + unsigned sz = 0; + struct log_writes_c *lc = ti->private; + + switch (type) { + case STATUSTYPE_INFO: + DMEMIT("%llu %llu", lc->logged_entries, + (unsigned long long)lc->next_sector - 1); + if (!lc->logging_enabled) + DMEMIT(" logging_disabled"); + break; + + case STATUSTYPE_TABLE: + DMEMIT("%s %s", lc->dev->name, lc->logdev->name); + break; + } +} + +static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd, + unsigned long arg) +{ + struct log_writes_c *lc = ti->private; + struct dm_dev *dev = lc->dev; + int r = 0; + + /* + * Only pass ioctls through if the device sizes match exactly. + */ + if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT) + r = scsi_verify_blk_ioctl(NULL, cmd); + + return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg); +} + +static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = lc->dev->bdev; + bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector); + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static int log_writes_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, + void *data) +{ + struct log_writes_c *lc = ti->private; + + return fn(ti, lc->dev, 0, ti->len, data); +} + +/* + * Messages supported: + * mark <mark data> - specify the marked data. + */ +static int log_writes_message(struct dm_target *ti, unsigned argc, char **argv) +{ + int r = -EINVAL; + struct log_writes_c *lc = ti->private; + + if (argc != 2) { + DMWARN("Invalid log-writes message arguments, expect 2 arguments, got %d", argc); + return r; + } + + if (!strcasecmp(argv[0], "mark")) + r = log_mark(lc, argv[1]); + else + DMWARN("Unrecognised log writes target message received: %s", argv[0]); + + return r; +} + +static void log_writes_io_hints(struct dm_target *ti, struct queue_limits *limits) +{ + struct log_writes_c *lc = ti->private; + struct request_queue *q = bdev_get_queue(lc->dev->bdev); + + if (!q || !blk_queue_discard(q)) { + lc->device_supports_discard = false; + limits->discard_granularity = 1 << SECTOR_SHIFT; + limits->max_discard_sectors = (UINT_MAX >> SECTOR_SHIFT); + } +} + +static struct target_type log_writes_target = { + .name = "log-writes", + .version = {1, 0, 0}, + .module = THIS_MODULE, + .ctr = log_writes_ctr, + .dtr = log_writes_dtr, + .map = log_writes_map, + .end_io = normal_end_io, + .status = log_writes_status, + .ioctl = log_writes_ioctl, + .merge = log_writes_merge, + .message = log_writes_message, + .iterate_devices = log_writes_iterate_devices, + .io_hints = log_writes_io_hints, +}; + +static int __init dm_log_writes_init(void) +{ + int r = dm_register_target(&log_writes_target); + + if (r < 0) + DMERR("register failed %d", r); + + return r; +} + +static void __exit dm_log_writes_exit(void) +{ + dm_unregister_target(&log_writes_target); +} + +/* Module hooks */ +module_init(dm_log_writes_init); +module_exit(dm_log_writes_exit); + +MODULE_DESCRIPTION(DM_NAME " log writes target"); +MODULE_AUTHOR("Josef Bacik <jbacik@fb.com>"); +MODULE_LICENSE("GPL"); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target V2 2015-03-20 14:50 ` Josef Bacik (?) @ 2015-03-20 16:31 ` Zach Brown -1 siblings, 0 replies; 22+ messages in thread From: Zach Brown @ 2015-03-20 16:31 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, linux-fsdevel, dm-devel, fstests, zab On Fri, Mar 20, 2015 at 10:50:37AM -0400, Josef Bacik wrote: > This creates a new target that is meant for file system developers to test file > system integrity at particular points in the life of a file system. We capture > all write requests and the data and log the requests and the data to a separate > device for later replay. There is a userspace utility to do this replay. The > idea behind this is to give file system developers to verify that the file > system is always consistent. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > V1->V2: fixed up stuff based on Zachs review. Cool, Reviewed-by: Zach Brown <zab@zabbo.net> - z ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target V2 2015-03-20 14:50 ` Josef Bacik (?) (?) @ 2015-03-24 15:33 ` Mike Snitzer 2015-04-07 14:41 ` Josef Bacik -1 siblings, 1 reply; 22+ messages in thread From: Mike Snitzer @ 2015-03-24 15:33 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, linux-fsdevel, dm-devel, fstests, zab On Fri, Mar 20 2015 at 10:50am -0400, Josef Bacik <jbacik@fb.com> wrote: > This creates a new target that is meant for file system developers to test file > system integrity at particular points in the life of a file system. We capture > all write requests and the data and log the requests and the data to a separate > device for later replay. There is a userspace utility to do this replay. The > idea behind this is to give file system developers to verify that the file > system is always consistent. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> Hey Josef, Nice job with this target, you need to contribute to DM more ;) I've staged your target for the 4.1 merge in linux-dm.git's 'for-next' branch (with a few small fixes for nits/typos). FYI, I'll likely rebase 'for-next' once more for 4.1 given a conflicting 4.0 change to DM core still needs to land upstream (you can see the merge conflict in 'for-next' at the moment). But that won't impact your target at all (other than changing the 4.1 commit id). Thanks, Mike ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target V2 2015-03-24 15:33 ` Mike Snitzer @ 2015-04-07 14:41 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-04-07 14:41 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-btrfs, linux-fsdevel, dm-devel, fstests, zab On 03/24/2015 11:33 AM, Mike Snitzer wrote: > On Fri, Mar 20 2015 at 10:50am -0400, > Josef Bacik <jbacik@fb.com> wrote: > >> This creates a new target that is meant for file system developers to test file >> system integrity at particular points in the life of a file system. We capture >> all write requests and the data and log the requests and the data to a separate >> device for later replay. There is a userspace utility to do this replay. The >> idea behind this is to give file system developers to verify that the file >> system is always consistent. Thanks, >> >> Signed-off-by: Josef Bacik <jbacik@fb.com> > > Hey Josef, > > Nice job with this target, you need to contribute to DM more ;) > > I've staged your target for the 4.1 merge in linux-dm.git's 'for-next' > branch (with a few small fixes for nits/typos). > > FYI, I'll likely rebase 'for-next' once more for 4.1 given a conflicting > 4.0 change to DM core still needs to land upstream (you can see the > merge conflict in 'for-next' at the moment). But that won't impact your > target at all (other than changing the 4.1 commit id). > Great thanks Mike! Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target V2 @ 2015-04-07 14:41 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-04-07 14:41 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-btrfs, linux-fsdevel, dm-devel, fstests, zab On 03/24/2015 11:33 AM, Mike Snitzer wrote: > On Fri, Mar 20 2015 at 10:50am -0400, > Josef Bacik <jbacik@fb.com> wrote: > >> This creates a new target that is meant for file system developers to test file >> system integrity at particular points in the life of a file system. We capture >> all write requests and the data and log the requests and the data to a separate >> device for later replay. There is a userspace utility to do this replay. The >> idea behind this is to give file system developers to verify that the file >> system is always consistent. Thanks, >> >> Signed-off-by: Josef Bacik <jbacik@fb.com> > > Hey Josef, > > Nice job with this target, you need to contribute to DM more ;) > > I've staged your target for the 4.1 merge in linux-dm.git's 'for-next' > branch (with a few small fixes for nits/typos). > > FYI, I'll likely rebase 'for-next' once more for 4.1 given a conflicting > 4.0 change to DM core still needs to land upstream (you can see the > merge conflict in 'for-next' at the moment). But that won't impact your > target at all (other than changing the 4.1 commit id). > Great thanks Mike! Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target 2015-03-19 20:31 ` Josef Bacik (?) (?) @ 2015-03-21 21:50 ` Dave Chinner 2015-04-07 14:43 ` Josef Bacik -1 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2015-03-21 21:50 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: > This creates a new target that is meant for file system developers to test file > system integrity at particular points in the life of a file system. We capture > all write requests and the data and log the requests and the data to a separate > device for later replay. There is a userspace utility to do this replay. The > idea behind this is to give file system developers to verify that the file > system is always consistent. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > Documentation/device-mapper/dm-log-writes.txt | 136 +++++ > drivers/md/Kconfig | 16 + > drivers/md/Makefile | 1 + > drivers/md/dm-log-writes.c | 809 ++++++++++++++++++++++++++ > 4 files changed, 962 insertions(+) > create mode 100644 Documentation/device-mapper/dm-log-writes.txt > create mode 100644 drivers/md/dm-log-writes.c > > diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt > new file mode 100644 > index 0000000..f3a9fa2 > --- /dev/null > +++ b/Documentation/device-mapper/dm-log-writes.txt > @@ -0,0 +1,136 @@ > +dm-log-writes > +============= > + > +This target takes 2 devices, one to pass all IO to normally, and one to log all > +of the write operations to. This is intended for file system developers wishing > +to verify the integrity of metadata or data as the file system is written to. > +There is a log_writes_entry written for every WRITE request and the target is > +able to take arbitrary data from userspace to insert into the log. The data > +that is in the WRITE requests is copied into the log to make the replay happen > +exactly as it happened originally. Hmm - terminology thing here - "log writes" have specific meaning to any application that does write ahead logging. E.g. journalling filesystems, databases, etc. So I find this name extremely confusing because a dm-log-write device has nothing to do with write ahead logging, log writes or journalling... I'm sure lots of other people are going to have the same problem understanding what this device is for because of that. I know this is effectively bikeshedding, but I think a less ambiguous name would be a good thing to have. e.g. dm-iotracer. Nobody will get confused that way.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target 2015-03-21 21:50 ` [PATCH 1/3] dm: log writes target Dave Chinner @ 2015-04-07 14:43 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-04-07 14:43 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests On 03/21/2015 05:50 PM, Dave Chinner wrote: > On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: >> This creates a new target that is meant for file system developers to test file >> system integrity at particular points in the life of a file system. We capture >> all write requests and the data and log the requests and the data to a separate >> device for later replay. There is a userspace utility to do this replay. The >> idea behind this is to give file system developers to verify that the file >> system is always consistent. Thanks, >> >> Signed-off-by: Josef Bacik <jbacik@fb.com> >> --- >> Documentation/device-mapper/dm-log-writes.txt | 136 +++++ >> drivers/md/Kconfig | 16 + >> drivers/md/Makefile | 1 + >> drivers/md/dm-log-writes.c | 809 ++++++++++++++++++++++++++ >> 4 files changed, 962 insertions(+) >> create mode 100644 Documentation/device-mapper/dm-log-writes.txt >> create mode 100644 drivers/md/dm-log-writes.c >> >> diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt >> new file mode 100644 >> index 0000000..f3a9fa2 >> --- /dev/null >> +++ b/Documentation/device-mapper/dm-log-writes.txt >> @@ -0,0 +1,136 @@ >> +dm-log-writes >> +============= >> + >> +This target takes 2 devices, one to pass all IO to normally, and one to log all >> +of the write operations to. This is intended for file system developers wishing >> +to verify the integrity of metadata or data as the file system is written to. >> +There is a log_writes_entry written for every WRITE request and the target is >> +able to take arbitrary data from userspace to insert into the log. The data >> +that is in the WRITE requests is copied into the log to make the replay happen >> +exactly as it happened originally. > > Hmm - terminology thing here - "log writes" have specific meaning to > any application that does write ahead logging. E.g. journalling > filesystems, databases, etc. So I find this name extremely confusing > because a dm-log-write device has nothing to do with write ahead > logging, log writes or journalling... > > I'm sure lots of other people are going to have the same problem > understanding what this device is for because of that. > > I know this is effectively bikeshedding, but I think a less > ambiguous name would be a good thing to have. e.g. dm-iotracer. > Nobody will get confused that way.... > Hey Dave, Sorry I sent this patch and then promptly went on vacation. Mike has already pulled the patch in and I've already gotten all this tooling built around the original name. It is probably a little confusing, but since only a few of us are going to mess with it and it's mostly going to be used in an xfstests context I'm not too worried about it. Thanks, Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] dm: log writes target @ 2015-04-07 14:43 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-04-07 14:43 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests On 03/21/2015 05:50 PM, Dave Chinner wrote: > On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: >> This creates a new target that is meant for file system developers to test file >> system integrity at particular points in the life of a file system. We capture >> all write requests and the data and log the requests and the data to a separate >> device for later replay. There is a userspace utility to do this replay. The >> idea behind this is to give file system developers to verify that the file >> system is always consistent. Thanks, >> >> Signed-off-by: Josef Bacik <jbacik@fb.com> >> --- >> Documentation/device-mapper/dm-log-writes.txt | 136 +++++ >> drivers/md/Kconfig | 16 + >> drivers/md/Makefile | 1 + >> drivers/md/dm-log-writes.c | 809 ++++++++++++++++++++++++++ >> 4 files changed, 962 insertions(+) >> create mode 100644 Documentation/device-mapper/dm-log-writes.txt >> create mode 100644 drivers/md/dm-log-writes.c >> >> diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt >> new file mode 100644 >> index 0000000..f3a9fa2 >> --- /dev/null >> +++ b/Documentation/device-mapper/dm-log-writes.txt >> @@ -0,0 +1,136 @@ >> +dm-log-writes >> +============= >> + >> +This target takes 2 devices, one to pass all IO to normally, and one to log all >> +of the write operations to. This is intended for file system developers wishing >> +to verify the integrity of metadata or data as the file system is written to. >> +There is a log_writes_entry written for every WRITE request and the target is >> +able to take arbitrary data from userspace to insert into the log. The data >> +that is in the WRITE requests is copied into the log to make the replay happen >> +exactly as it happened originally. > > Hmm - terminology thing here - "log writes" have specific meaning to > any application that does write ahead logging. E.g. journalling > filesystems, databases, etc. So I find this name extremely confusing > because a dm-log-write device has nothing to do with write ahead > logging, log writes or journalling... > > I'm sure lots of other people are going to have the same problem > understanding what this device is for because of that. > > I know this is effectively bikeshedding, but I think a less > ambiguous name would be a good thing to have. e.g. dm-iotracer. > Nobody will get confused that way.... > Hey Dave, Sorry I sent this patch and then promptly went on vacation. Mike has already pulled the patch in and I've already gotten all this tooling built around the original name. It is probably a little confusing, but since only a few of us are going to mess with it and it's mostly going to be used in an xfstests context I'm not too worried about it. Thanks, Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [dm-devel] [PATCH 1/3] dm: log writes target 2015-03-19 20:31 ` Josef Bacik ` (2 preceding siblings ...) (?) @ 2015-03-23 18:02 ` Vivek Goyal 2015-04-07 14:45 ` Josef Bacik -1 siblings, 1 reply; 22+ messages in thread From: Vivek Goyal @ 2015-03-23 18:02 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: [..] > + * We log writes only after they have been flushed, this makes the log describe > + * close to the order in which the data hits the actual disk, not its cache. So > + * for example the following sequence (W means write, C means complete) > + * > + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd > + * > + * Would result in the log looking like this > + * > + * c,a,flush,fuad,b,<other writes>,<next flush> > + * A minor nit, Should this sequence be following. c,a,b, flush,fuad,<other writes>,<next flush> when flush completed by that time write of b has completed too. So it should be written first? Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [dm-devel] [PATCH 1/3] dm: log writes target 2015-03-23 18:02 ` [dm-devel] " Vivek Goyal @ 2015-04-07 14:45 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-04-07 14:45 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests On 03/23/2015 02:02 PM, Vivek Goyal wrote: > On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: > > [..] >> + * We log writes only after they have been flushed, this makes the log describe >> + * close to the order in which the data hits the actual disk, not its cache. So >> + * for example the following sequence (W means write, C means complete) >> + * >> + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd >> + * >> + * Would result in the log looking like this >> + * >> + * c,a,flush,fuad,b,<other writes>,<next flush> >> + * > > A minor nit, Should this sequence be following. > > c,a,b, flush,fuad,<other writes>,<next flush> > > when flush completed by that time write of b has completed too. So it > should be written first? > So we want to catch file systems behaving badly by not waiting for the IO they care about to complete before issuing their flush, so we take the super pessimistic view that only IO that has completed by FLUSH issue time can truly be safe. For all we know the flush could have happened first and we just happen to get the endio called for b first instead of the flush, so to make it mostly likely that we catch fs bugs we enforce this idea that only completed IO can be sure to have been flushed at flush submit time. Thanks, Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [dm-devel] [PATCH 1/3] dm: log writes target @ 2015-04-07 14:45 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-04-07 14:45 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests On 03/23/2015 02:02 PM, Vivek Goyal wrote: > On Thu, Mar 19, 2015 at 04:31:08PM -0400, Josef Bacik wrote: > > [..] >> + * We log writes only after they have been flushed, this makes the log describe >> + * close to the order in which the data hits the actual disk, not its cache. So >> + * for example the following sequence (W means write, C means complete) >> + * >> + * Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd >> + * >> + * Would result in the log looking like this >> + * >> + * c,a,flush,fuad,b,<other writes>,<next flush> >> + * > > A minor nit, Should this sequence be following. > > c,a,b, flush,fuad,<other writes>,<next flush> > > when flush completed by that time write of b has completed too. So it > should be written first? > So we want to catch file systems behaving badly by not waiting for the IO they care about to complete before issuing their flush, so we take the super pessimistic view that only IO that has completed by FLUSH issue time can truly be safe. For all we know the flush could have happened first and we just happen to get the endio called for b first instead of the flush, so to make it mostly likely that we catch fs bugs we enforce this idea that only completed IO can be sure to have been flushed at flush submit time. Thanks, Josef ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/3] fstests: add dm-log-writes test and supporting code 2015-03-19 20:31 ` Josef Bacik @ 2015-03-19 20:31 ` Josef Bacik -1 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests This patch adds the supporting code for using the dm-log-writes target. The bash stuff is similar to the dmflakey code, it just gives us functions to build and tear down a dm-log-writes target. We add a new LOGWRITES_DEV variable to take in the device we will use as the log and add checks for that. I've rigged up fsx to have an integrity check mode. Basically it works like it normally works, but when it fsync()'s it marks the log with a unique mark and dumps it's buffer to a file with the mark in the filename. I did this with a system() call simply because it was the fastest. I can link the device-mapper libraries and do it programatically if that would be preferred, but this works pretty well. The test itself just runs 200 ops and exits, then finds all of the good buffers in the directory we provided and replays up to the mark given, mounts the file system and compares the md5sum, unmounts and fsck's to check for metadata integrity. dm-log-writes will pretend to do discard and the replay tool will replay it properly depending on the underlying device, either by writing 0's or actually calling the discard ioctl, so I've enabled discard in the test for maximum fun. This test relies on the supporting userspace code I've written for dm-logs-writes. It can be found here https://github.com/josefbacik/log-writes.git Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- README | 2 + common/config | 1 + common/dmlogwrites | 80 ++++++++++++++++++++++++++++++ common/rc | 46 ++++++++++++++++++ ltp/fsx.c | 131 ++++++++++++++++++++++++++++++++++++++++++-------- tests/generic/326 | 130 +++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/326.out | 2 + tests/generic/group | 1 + 8 files changed, 374 insertions(+), 19 deletions(-) create mode 100644 common/dmlogwrites create mode 100644 tests/generic/326 create mode 100644 tests/generic/326.out diff --git a/README b/README index 0c9449a..112478e 100644 --- a/README +++ b/README @@ -78,6 +78,8 @@ Preparing system for tests (IRIX and Linux): added to the end of fsstresss and fsx invocations, respectively, in case you wish to exclude certain operational modes from these tests. + - setenv LOGWRITES_DEV to a block device to use for power fail + testing. - or add a case to the switch in common/config assigning these variables based on the hostname of your test diff --git a/common/config b/common/config index e5c3579..563e48e 100644 --- a/common/config +++ b/common/config @@ -190,6 +190,7 @@ export DMSETUP_PROG="`set_prog_path dmsetup`" export WIPEFS_PROG="`set_prog_path wipefs`" export DUMP_PROG="`set_prog_path dump`" export RESTORE_PROG="`set_prog_path restore`" +export REPLAYLOG_PROG="`set_prog_path replay-log`" # Generate a comparable xfsprogs version number in the form of # major * 10000 + minor * 100 + release diff --git a/common/dmlogwrites b/common/dmlogwrites new file mode 100644 index 0000000..4df9ea7 --- /dev/null +++ b/common/dmlogwrites @@ -0,0 +1,80 @@ +##/bin/bash +# +# Copyright (c) 2015 Facebook, Inc. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +# +# common functions for setting up and tearing down a dm log-writes device + +_init_log_writes() +{ + local BLK_DEV_SIZE=`blockdev --getsz $SCRATCH_DEV` + LOGWRITES_NAME=logwrites-test + LOGWRITES_DMDEV=/dev/mapper/$LOGWRITES_NAME + LOGWRITES_TABLE="0 $BLK_DEV_SIZE log-writes $SCRATCH_DEV $LOGWRITES_DEV" + $DMSETUP_PROG create $LOGWRITES_NAME --table "$LOGWRITES_TABLE" || \ + _fatal "failed to create log-writes device" + $DMSETUP_PROG mknodes > /dev/null 2>&1 +} + +_log_writes_mark() +{ + [ $# -ne 1 ] && _fatal "_log_writes_mark takes one argument" + $DMSETUP_PROG message $LOGWRITES_NAME 0 mark $1 +} + +_log_writes_mkfs() +{ + _scratch_options mkfs + _mkfs_dev $SCRATCH_OPTIONS $LOGWRITES_DMDEV + _log_writes_mark mkfs +} + +_mount_log_writes() +{ + mount -t $FSTYP $MOUNT_OPTIONS $* $LOGWRITES_DMDEV $SCRATCH_MNT +} + +_unmount_log_writes() +{ + $UMOUNT_PROG $SCRATCH_MNT +} + +# _replay_log <mark> +# +# This replays the log contained on $INTEGRITY_DEV onto $SCRATCH_DEV upto the +# mark passed in. +_replay_log() +{ + _mark=$1 + + $REPLAYLOG_PROG --log $LOGWRITES_DEV --replay $SCRATCH_DEV \ + --end-mark $_mark > /dev/null 2>&1 + [ $? -ne 0 ] && _fatal "replay failed" +} + +_log_writes_remove() +{ + $DMSETUP_PROG remove $LOGWRITES_NAME > /dev/null 2>&1 + $DMSETUP_PROG mknodes > /dev/null 2>&1 +} + +_cleanup_log_writes() +{ + # If dmsetup load fails then we need to make sure to do resume here + # otherwise the umount will hang + $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _log_writes_remove +} diff --git a/common/rc b/common/rc index 857308a..c6c2059 100644 --- a/common/rc +++ b/common/rc @@ -1311,6 +1311,24 @@ _require_dm_flakey() fi } +# this test requires the device mapper log-writes target +# +_require_dm_log_writes() +{ + [ -z $LOGWRITES_DEV ] && _notrun "This test requires a logwrites dev" + _require_block_device $SCRATCH_DEV + _require_block_device $LOGWRITES_DEV + _require_command $DMSETUP_PROG + _require_command $REPLAYLOG_PROG + + modprobe dm-log-writes >/dev/null 2>&1 + $DMSETUP_PROG targets | grep "log-writes" > /dev/null 2>&1 + if [ $? -ne 0 ] + then + _notrun "This test requires dm log-writes support" + fi +} + # this test requires the projid32bit feature to be available in mkfs.xfs. # _require_projid32bit() @@ -1545,6 +1563,34 @@ _require_xfs_io_command() _notrun "xfs_io $command failed (old kernel/wrong fs?)" } +_test_falloc_support() +{ + if [ $# -ne 1 ] + then + echo "Usage: _test_falloc_support command" 1>&2 + exit 1 + fi + command=$1 + + testfile=$TEST_DIR/$$.xfs_io + case $command in + "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" ) + testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \ + -c "$command 4k 8k" $testfile 2>&1` + ;; + *) + echo "Not a valid falloc command" 1>&2 + exit 1 + esac + + rm -f $testfile 2>&1 > /dev/null + echo $testio | grep -q "not found" && \ + return 0 + echo $testio | grep -q "Operation not supported" && \ + return 0 + return 1 +} + # check that kernel and filesystem support direct I/O _require_odirect() { diff --git a/ltp/fsx.c b/ltp/fsx.c index 6da51e9..47ac865 100644 --- a/ltp/fsx.c +++ b/ltp/fsx.c @@ -61,15 +61,17 @@ int logcount = 0; /* total ops */ * be careful in how we select the different operations. The active operations * are mapped to numbers as follows: * - * lite !lite - * READ: 0 0 - * WRITE: 1 1 - * MAPREAD: 2 2 - * MAPWRITE: 3 3 - * TRUNCATE: - 4 - * FALLOCATE: - 5 - * PUNCH HOLE: - 6 - * ZERO RANGE: - 7 + * lite !lite integrity + * READ: 0 0 0 + * WRITE: 1 1 1 + * MAPREAD: 2 2 2 + * MAPWRITE: 3 3 3 + * TRUNCATE: - 4 4 + * FALLOCATE: - 5 5 + * PUNCH HOLE: - 6 6 + * ZERO RANGE: - 7 7 + * COLLAPSE RANGE: - 8 8 + * FSYNC: - - 9 * * When mapped read/writes are disabled, they are simply converted to normal * reads and writes. When fallocate/fpunch calls are disabled, they are @@ -98,6 +100,10 @@ int logcount = 0; /* total ops */ #define OP_INSERT_RANGE 9 #define OP_MAX_FULL 10 +/* integrity operations */ +#define OP_FSYNC 10 +#define OP_MAX_INTEGRITY 11 + /* operation modifiers */ #define OP_CLOSEOPEN 100 #define OP_SKIPPED 101 @@ -111,6 +117,9 @@ char *original_buf; /* a pointer to the original data */ char *good_buf; /* a pointer to the correct data */ char *temp_buf; /* a pointer to the current data */ char *fname; /* name of our test file */ +char *bname; /* basename of our test file */ +char *logdev; /* -I flag */ +char dname[1024]; /* -P flag */ int fd; /* fd for our test file */ blksize_t block_size = 0; @@ -149,9 +158,11 @@ int zero_range_calls = 1; /* -z flag disables */ int collapse_range_calls = 1; /* -C flag disables */ int insert_range_calls = 1; /* -I flag disables */ int mapped_reads = 1; /* -R flag disables it */ +int integrity = 0; /* -I flag */ int fsxgoodfd = 0; int o_direct; /* -Z */ int aio = 0; +int mark_nr = 0; int page_size; int page_mask; @@ -350,6 +361,9 @@ logdump(void) lp->args[0] + lp->args[1]) prt("\t******IIII"); break; + case OP_FSYNC: + prt("FSYNC"); + break; case OP_SKIPPED: prt("SKIPPED (no operation)"); break; @@ -429,6 +443,42 @@ report_failure(int status) *(((unsigned char *)(cp)) + 1))) void +mark_log(void) +{ + char command[256]; + int ret; + + snprintf(command, 256, "dmsetup message %s 0 mark %s.mark%d", logdev, + bname, mark_nr); + ret = system(command); + if (ret) { + prterr("dmsetup mark failed"); + exit(1); + } +} + +void +dump_fsync_buffer(void) +{ + char fname_buffer[1024]; + int good_fd; + + if (!good_buf) + return; + + snprintf(fname_buffer, 1024, "%s%s.mark%d", dname, + bname, mark_nr); + good_fd = open(fname_buffer, O_WRONLY|O_CREAT|O_TRUNC, 0666); + if (good_fd < 0) { + prterr(fname_buffer); + exit(1); + } + + save_buffer(good_buf, file_size, good_fd); + close(good_fd); +} + +void check_buffers(unsigned offset, unsigned size) { unsigned char c, t; @@ -1183,6 +1233,26 @@ docloseopen(void) } } +void +dofsync(void) +{ + int ret; + + if (testcalls <= simulatedopcount) + return; + if (debug) + prt("%lu fsync\n", testcalls); + log4(OP_FSYNC, 0, 0, 0); + ret = fsync(fd); + if (ret < 0) { + prterr("dofsync"); + report_failure(190); + } + mark_log(); + dump_fsync_buffer(); + printf("Dumped fsync buffer mark %d\n", mark_nr); + mark_nr++; +} #define TRIM_OFF(off, size) \ do { \ @@ -1233,8 +1303,10 @@ test(void) /* calculate appropriate op to run */ if (lite) op = rv % OP_MAX_LITE; - else + else if (!integrity) op = rv % OP_MAX_FULL; + else + op = rv % OP_MAX_INTEGRITY; switch (op) { case OP_MAPREAD: @@ -1343,6 +1415,9 @@ test(void) do_insert_range(offset, size); break; + case OP_FSYNC: + dofsync(); + break; default: prterr("test: unknown operation"); report_failure(42); @@ -1372,7 +1447,7 @@ void usage(void) { fprintf(stdout, "usage: %s", - "fsx [-dnqxAFLOWZ] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-D startingop] [-N numops] [-P dirpath] [-S seed] fname\n\ + "fsx [-dnqxAFLOWZ] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-D startingop] [-N numops] [-P dirpath] [-S seed] [-I logdev] fname\n\ -b opnum: beginning operation number (default 1)\n\ -c P: 1 in P chance of file close+open at each op (default infinity)\n\ -d: debug output for all operations\n\ @@ -1417,6 +1492,7 @@ usage(void) -W: mapped write operations DISabled\n\ -R: read() system calls only (mapped reads disabled)\n\ -Z: O_DIRECT (use -R, -W, -r and -w too)\n\ + -i logdev: do integrity testing, logdev is the dm log writes device\n\ fname: this filename is REQUIRED (no default)\n"); exit(90); } @@ -1580,13 +1656,14 @@ int main(int argc, char **argv) { int i, style, ch; - char *endp; + char *endp, *tmp; char goodfile[1024]; char logfile[1024]; struct stat statbuf; goodfile[0] = 0; logfile[0] = 0; + dname[0] = 0; page_size = getpagesize(); page_mask = page_size - 1; @@ -1595,7 +1672,7 @@ main(int argc, char **argv) setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */ - while ((ch = getopt(argc, argv, "b:c:dfl:m:no:p:qr:s:t:w:xyAD:FKHzCILN:OP:RS:WZ")) + while ((ch = getopt(argc, argv, "b:c:dfl:m:no:p:qr:s:t:w:xyAD:FKHzCILN:OP:RS:WZi:")) != EOF) switch (ch) { case 'b': @@ -1719,10 +1796,11 @@ main(int argc, char **argv) randomoplen = 0; break; case 'P': - strncpy(goodfile, optarg, sizeof(goodfile)); - strcat(goodfile, "/"); - strncpy(logfile, optarg, sizeof(logfile)); - strcat(logfile, "/"); + strncpy(dname, optarg, sizeof(dname)); + strcat(dname, "/"); + + strncpy(goodfile, dname, sizeof(goodfile)); + strncpy(logfile, dname, sizeof(logfile)); break; case 'R': mapped_reads = 0; @@ -1744,6 +1822,14 @@ main(int argc, char **argv) case 'Z': o_direct = O_DIRECT; break; + case 'i': + integrity = 1; + logdev = strdup(optarg); + if (!logdev) { + prterr("malloc"); + exit(1); + } + break; default: usage(); /* NOTREACHED */ @@ -1753,6 +1839,12 @@ main(int argc, char **argv) if (argc != 1) usage(); fname = argv[0]; + tmp = strdup(fname); + if (!tmp) { + prterr("strdup"); + exit(1); + } + bname = basename(tmp); signal(SIGHUP, cleanup); signal(SIGINT, cleanup); @@ -1795,14 +1887,14 @@ main(int argc, char **argv) } } #endif - strncat(goodfile, fname, 256); + strncat(goodfile, bname, 256); strcat (goodfile, ".fsxgood"); fsxgoodfd = open(goodfile, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fsxgoodfd < 0) { prterr(goodfile); exit(92); } - strncat(logfile, fname, 256); + strncat(logfile, bname, 256); strcat (logfile, ".fsxlog"); fsxlogf = fopen(logfile, "w"); if (fsxlogf == NULL) { @@ -1874,6 +1966,7 @@ main(int argc, char **argv) while (numops == -1 || numops--) test(); + free(tmp); if (close(fd)) { prterr("close"); report_failure(99); diff --git a/tests/generic/326 b/tests/generic/326 new file mode 100644 index 0000000..b4346e6 --- /dev/null +++ b/tests/generic/326 @@ -0,0 +1,130 @@ +#! /bin/bash +# FS QA Test No. 326 +# +# Run fsx with log writes to verify power fail safeness. +# +#----------------------------------------------------------------------- +# Copyright (c) 2015 Facebook. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_log_writes +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmlogwrites + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch_nocheck +_require_dm_log_writes + +rm -f $seqres.full +rm -rf $TEST_DIR/fsxtests + +_check_files() +{ + _name=$1 + # Now look for our files + for i in $(find $SANITY_DIR -type f | grep $_name | grep mark) + do + filename=$(basename $i) + mark="${filename##*.}" + echo "checking $filename" >> $seqres.full + _replay_log $filename + _scratch_mount + expected_md5=$(md5sum $i | cut -f 1 -d ' ') + md5=$(md5sum $SCRATCH_MNT/$_name | cut -f 1 -d ' ') + [ "${md5}x" != "${expected_md5}x" ] && _fatal "md5sum mismatched" + _scratch_unmount + _check_scratch_fs + done +} + +SANITY_DIR=$TEST_DIR/fsxtests +mkdir $SANITY_DIR + +# Create the log +_init_log_writes + +_log_writes_mkfs >> $seqres.full 2>&1 + +# Log writes emulates discard support, turn it on for maximum crying. +_mount_log_writes -o discard + +FSX_OPTS="" +[ $(_test_falloc_support "fpunch") ] || FSX_OPTS="-H" +[ $(_test_falloc_support "fcollapse") ] || FSX_OPTS="$FSX_OPTS -C" +[ $(_test_falloc_support "fzero") ] || FSX_OPTS="$FSX_OPTS -z" +[ $(_test_falloc_support "finsert") ] || FSX_OPTS="$FSX_OPTS -I" + +# Run fsx for a while +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile1 & +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile2 & +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile3 & +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile4 & +wait +test1_md5=$(md5sum $SCRATCH_MNT/testfile1 | cut -f 1 -d ' ') +test2_md5=$(md5sum $SCRATCH_MNT/testfile2 | cut -f 1 -d ' ') +test3_md5=$(md5sum $SCRATCH_MNT/testfile3 | cut -f 1 -d ' ') +test4_md5=$(md5sum $SCRATCH_MNT/testfile4 | cut -f 1 -d ' ') + +# Unmount the scratch dir and tear down the log writes target +_unmount_log_writes +_log_writes_mark end +_log_writes_remove + +for i in testfile1 testfile2 testfile3 testfile4 +do + _check_files $i +done + +# Check the end +_replay_log end +_scratch_mount +md5=$(md5sum $SCRATCH_MNT/testfile1 | cut -f 1 -d ' ') +[ "${md5}x" != "${test1_md5}x" ] && _fatal "testfile1 end md5sum mismatched" +md5=$(md5sum $SCRATCH_MNT/testfile2 | cut -f 1 -d ' ') +[ "${md5}x" != "${test2_md5}x" ] && _fatal "testfile2 end md5sum mismatched" +md5=$(md5sum $SCRATCH_MNT/testfile3 | cut -f 1 -d ' ') +[ "${md5}x" != "${test3_md5}x" ] && _fatal "testfile3 end md5sum mismatched" +md5=$(md5sum $SCRATCH_MNT/testfile4 | cut -f 1 -d ' ') +[ "${md5}x" != "${test4_md5}x" ] && _fatal "testfile4 end md5sum mismatched" +_scratch_unmount +_check_scratch_fs + +echo "Silence is golden" +status=0 +exit + diff --git a/tests/generic/326.out b/tests/generic/326.out new file mode 100644 index 0000000..4ac0db5 --- /dev/null +++ b/tests/generic/326.out @@ -0,0 +1,2 @@ +QA output created by 326 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index d56d3ce..31e5f7d 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -183,3 +183,4 @@ 323 auto aio stress 324 auto fsr quick 325 auto quick data log +326 auto log -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/3] fstests: add dm-log-writes test and supporting code @ 2015-03-19 20:31 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests This patch adds the supporting code for using the dm-log-writes target. The bash stuff is similar to the dmflakey code, it just gives us functions to build and tear down a dm-log-writes target. We add a new LOGWRITES_DEV variable to take in the device we will use as the log and add checks for that. I've rigged up fsx to have an integrity check mode. Basically it works like it normally works, but when it fsync()'s it marks the log with a unique mark and dumps it's buffer to a file with the mark in the filename. I did this with a system() call simply because it was the fastest. I can link the device-mapper libraries and do it programatically if that would be preferred, but this works pretty well. The test itself just runs 200 ops and exits, then finds all of the good buffers in the directory we provided and replays up to the mark given, mounts the file system and compares the md5sum, unmounts and fsck's to check for metadata integrity. dm-log-writes will pretend to do discard and the replay tool will replay it properly depending on the underlying device, either by writing 0's or actually calling the discard ioctl, so I've enabled discard in the test for maximum fun. This test relies on the supporting userspace code I've written for dm-logs-writes. It can be found here https://github.com/josefbacik/log-writes.git Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- README | 2 + common/config | 1 + common/dmlogwrites | 80 ++++++++++++++++++++++++++++++ common/rc | 46 ++++++++++++++++++ ltp/fsx.c | 131 ++++++++++++++++++++++++++++++++++++++++++-------- tests/generic/326 | 130 +++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/326.out | 2 + tests/generic/group | 1 + 8 files changed, 374 insertions(+), 19 deletions(-) create mode 100644 common/dmlogwrites create mode 100644 tests/generic/326 create mode 100644 tests/generic/326.out diff --git a/README b/README index 0c9449a..112478e 100644 --- a/README +++ b/README @@ -78,6 +78,8 @@ Preparing system for tests (IRIX and Linux): added to the end of fsstresss and fsx invocations, respectively, in case you wish to exclude certain operational modes from these tests. + - setenv LOGWRITES_DEV to a block device to use for power fail + testing. - or add a case to the switch in common/config assigning these variables based on the hostname of your test diff --git a/common/config b/common/config index e5c3579..563e48e 100644 --- a/common/config +++ b/common/config @@ -190,6 +190,7 @@ export DMSETUP_PROG="`set_prog_path dmsetup`" export WIPEFS_PROG="`set_prog_path wipefs`" export DUMP_PROG="`set_prog_path dump`" export RESTORE_PROG="`set_prog_path restore`" +export REPLAYLOG_PROG="`set_prog_path replay-log`" # Generate a comparable xfsprogs version number in the form of # major * 10000 + minor * 100 + release diff --git a/common/dmlogwrites b/common/dmlogwrites new file mode 100644 index 0000000..4df9ea7 --- /dev/null +++ b/common/dmlogwrites @@ -0,0 +1,80 @@ +##/bin/bash +# +# Copyright (c) 2015 Facebook, Inc. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +# +# common functions for setting up and tearing down a dm log-writes device + +_init_log_writes() +{ + local BLK_DEV_SIZE=`blockdev --getsz $SCRATCH_DEV` + LOGWRITES_NAME=logwrites-test + LOGWRITES_DMDEV=/dev/mapper/$LOGWRITES_NAME + LOGWRITES_TABLE="0 $BLK_DEV_SIZE log-writes $SCRATCH_DEV $LOGWRITES_DEV" + $DMSETUP_PROG create $LOGWRITES_NAME --table "$LOGWRITES_TABLE" || \ + _fatal "failed to create log-writes device" + $DMSETUP_PROG mknodes > /dev/null 2>&1 +} + +_log_writes_mark() +{ + [ $# -ne 1 ] && _fatal "_log_writes_mark takes one argument" + $DMSETUP_PROG message $LOGWRITES_NAME 0 mark $1 +} + +_log_writes_mkfs() +{ + _scratch_options mkfs + _mkfs_dev $SCRATCH_OPTIONS $LOGWRITES_DMDEV + _log_writes_mark mkfs +} + +_mount_log_writes() +{ + mount -t $FSTYP $MOUNT_OPTIONS $* $LOGWRITES_DMDEV $SCRATCH_MNT +} + +_unmount_log_writes() +{ + $UMOUNT_PROG $SCRATCH_MNT +} + +# _replay_log <mark> +# +# This replays the log contained on $INTEGRITY_DEV onto $SCRATCH_DEV upto the +# mark passed in. +_replay_log() +{ + _mark=$1 + + $REPLAYLOG_PROG --log $LOGWRITES_DEV --replay $SCRATCH_DEV \ + --end-mark $_mark > /dev/null 2>&1 + [ $? -ne 0 ] && _fatal "replay failed" +} + +_log_writes_remove() +{ + $DMSETUP_PROG remove $LOGWRITES_NAME > /dev/null 2>&1 + $DMSETUP_PROG mknodes > /dev/null 2>&1 +} + +_cleanup_log_writes() +{ + # If dmsetup load fails then we need to make sure to do resume here + # otherwise the umount will hang + $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _log_writes_remove +} diff --git a/common/rc b/common/rc index 857308a..c6c2059 100644 --- a/common/rc +++ b/common/rc @@ -1311,6 +1311,24 @@ _require_dm_flakey() fi } +# this test requires the device mapper log-writes target +# +_require_dm_log_writes() +{ + [ -z $LOGWRITES_DEV ] && _notrun "This test requires a logwrites dev" + _require_block_device $SCRATCH_DEV + _require_block_device $LOGWRITES_DEV + _require_command $DMSETUP_PROG + _require_command $REPLAYLOG_PROG + + modprobe dm-log-writes >/dev/null 2>&1 + $DMSETUP_PROG targets | grep "log-writes" > /dev/null 2>&1 + if [ $? -ne 0 ] + then + _notrun "This test requires dm log-writes support" + fi +} + # this test requires the projid32bit feature to be available in mkfs.xfs. # _require_projid32bit() @@ -1545,6 +1563,34 @@ _require_xfs_io_command() _notrun "xfs_io $command failed (old kernel/wrong fs?)" } +_test_falloc_support() +{ + if [ $# -ne 1 ] + then + echo "Usage: _test_falloc_support command" 1>&2 + exit 1 + fi + command=$1 + + testfile=$TEST_DIR/$$.xfs_io + case $command in + "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" ) + testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \ + -c "$command 4k 8k" $testfile 2>&1` + ;; + *) + echo "Not a valid falloc command" 1>&2 + exit 1 + esac + + rm -f $testfile 2>&1 > /dev/null + echo $testio | grep -q "not found" && \ + return 0 + echo $testio | grep -q "Operation not supported" && \ + return 0 + return 1 +} + # check that kernel and filesystem support direct I/O _require_odirect() { diff --git a/ltp/fsx.c b/ltp/fsx.c index 6da51e9..47ac865 100644 --- a/ltp/fsx.c +++ b/ltp/fsx.c @@ -61,15 +61,17 @@ int logcount = 0; /* total ops */ * be careful in how we select the different operations. The active operations * are mapped to numbers as follows: * - * lite !lite - * READ: 0 0 - * WRITE: 1 1 - * MAPREAD: 2 2 - * MAPWRITE: 3 3 - * TRUNCATE: - 4 - * FALLOCATE: - 5 - * PUNCH HOLE: - 6 - * ZERO RANGE: - 7 + * lite !lite integrity + * READ: 0 0 0 + * WRITE: 1 1 1 + * MAPREAD: 2 2 2 + * MAPWRITE: 3 3 3 + * TRUNCATE: - 4 4 + * FALLOCATE: - 5 5 + * PUNCH HOLE: - 6 6 + * ZERO RANGE: - 7 7 + * COLLAPSE RANGE: - 8 8 + * FSYNC: - - 9 * * When mapped read/writes are disabled, they are simply converted to normal * reads and writes. When fallocate/fpunch calls are disabled, they are @@ -98,6 +100,10 @@ int logcount = 0; /* total ops */ #define OP_INSERT_RANGE 9 #define OP_MAX_FULL 10 +/* integrity operations */ +#define OP_FSYNC 10 +#define OP_MAX_INTEGRITY 11 + /* operation modifiers */ #define OP_CLOSEOPEN 100 #define OP_SKIPPED 101 @@ -111,6 +117,9 @@ char *original_buf; /* a pointer to the original data */ char *good_buf; /* a pointer to the correct data */ char *temp_buf; /* a pointer to the current data */ char *fname; /* name of our test file */ +char *bname; /* basename of our test file */ +char *logdev; /* -I flag */ +char dname[1024]; /* -P flag */ int fd; /* fd for our test file */ blksize_t block_size = 0; @@ -149,9 +158,11 @@ int zero_range_calls = 1; /* -z flag disables */ int collapse_range_calls = 1; /* -C flag disables */ int insert_range_calls = 1; /* -I flag disables */ int mapped_reads = 1; /* -R flag disables it */ +int integrity = 0; /* -I flag */ int fsxgoodfd = 0; int o_direct; /* -Z */ int aio = 0; +int mark_nr = 0; int page_size; int page_mask; @@ -350,6 +361,9 @@ logdump(void) lp->args[0] + lp->args[1]) prt("\t******IIII"); break; + case OP_FSYNC: + prt("FSYNC"); + break; case OP_SKIPPED: prt("SKIPPED (no operation)"); break; @@ -429,6 +443,42 @@ report_failure(int status) *(((unsigned char *)(cp)) + 1))) void +mark_log(void) +{ + char command[256]; + int ret; + + snprintf(command, 256, "dmsetup message %s 0 mark %s.mark%d", logdev, + bname, mark_nr); + ret = system(command); + if (ret) { + prterr("dmsetup mark failed"); + exit(1); + } +} + +void +dump_fsync_buffer(void) +{ + char fname_buffer[1024]; + int good_fd; + + if (!good_buf) + return; + + snprintf(fname_buffer, 1024, "%s%s.mark%d", dname, + bname, mark_nr); + good_fd = open(fname_buffer, O_WRONLY|O_CREAT|O_TRUNC, 0666); + if (good_fd < 0) { + prterr(fname_buffer); + exit(1); + } + + save_buffer(good_buf, file_size, good_fd); + close(good_fd); +} + +void check_buffers(unsigned offset, unsigned size) { unsigned char c, t; @@ -1183,6 +1233,26 @@ docloseopen(void) } } +void +dofsync(void) +{ + int ret; + + if (testcalls <= simulatedopcount) + return; + if (debug) + prt("%lu fsync\n", testcalls); + log4(OP_FSYNC, 0, 0, 0); + ret = fsync(fd); + if (ret < 0) { + prterr("dofsync"); + report_failure(190); + } + mark_log(); + dump_fsync_buffer(); + printf("Dumped fsync buffer mark %d\n", mark_nr); + mark_nr++; +} #define TRIM_OFF(off, size) \ do { \ @@ -1233,8 +1303,10 @@ test(void) /* calculate appropriate op to run */ if (lite) op = rv % OP_MAX_LITE; - else + else if (!integrity) op = rv % OP_MAX_FULL; + else + op = rv % OP_MAX_INTEGRITY; switch (op) { case OP_MAPREAD: @@ -1343,6 +1415,9 @@ test(void) do_insert_range(offset, size); break; + case OP_FSYNC: + dofsync(); + break; default: prterr("test: unknown operation"); report_failure(42); @@ -1372,7 +1447,7 @@ void usage(void) { fprintf(stdout, "usage: %s", - "fsx [-dnqxAFLOWZ] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-D startingop] [-N numops] [-P dirpath] [-S seed] fname\n\ + "fsx [-dnqxAFLOWZ] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-D startingop] [-N numops] [-P dirpath] [-S seed] [-I logdev] fname\n\ -b opnum: beginning operation number (default 1)\n\ -c P: 1 in P chance of file close+open at each op (default infinity)\n\ -d: debug output for all operations\n\ @@ -1417,6 +1492,7 @@ usage(void) -W: mapped write operations DISabled\n\ -R: read() system calls only (mapped reads disabled)\n\ -Z: O_DIRECT (use -R, -W, -r and -w too)\n\ + -i logdev: do integrity testing, logdev is the dm log writes device\n\ fname: this filename is REQUIRED (no default)\n"); exit(90); } @@ -1580,13 +1656,14 @@ int main(int argc, char **argv) { int i, style, ch; - char *endp; + char *endp, *tmp; char goodfile[1024]; char logfile[1024]; struct stat statbuf; goodfile[0] = 0; logfile[0] = 0; + dname[0] = 0; page_size = getpagesize(); page_mask = page_size - 1; @@ -1595,7 +1672,7 @@ main(int argc, char **argv) setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */ - while ((ch = getopt(argc, argv, "b:c:dfl:m:no:p:qr:s:t:w:xyAD:FKHzCILN:OP:RS:WZ")) + while ((ch = getopt(argc, argv, "b:c:dfl:m:no:p:qr:s:t:w:xyAD:FKHzCILN:OP:RS:WZi:")) != EOF) switch (ch) { case 'b': @@ -1719,10 +1796,11 @@ main(int argc, char **argv) randomoplen = 0; break; case 'P': - strncpy(goodfile, optarg, sizeof(goodfile)); - strcat(goodfile, "/"); - strncpy(logfile, optarg, sizeof(logfile)); - strcat(logfile, "/"); + strncpy(dname, optarg, sizeof(dname)); + strcat(dname, "/"); + + strncpy(goodfile, dname, sizeof(goodfile)); + strncpy(logfile, dname, sizeof(logfile)); break; case 'R': mapped_reads = 0; @@ -1744,6 +1822,14 @@ main(int argc, char **argv) case 'Z': o_direct = O_DIRECT; break; + case 'i': + integrity = 1; + logdev = strdup(optarg); + if (!logdev) { + prterr("malloc"); + exit(1); + } + break; default: usage(); /* NOTREACHED */ @@ -1753,6 +1839,12 @@ main(int argc, char **argv) if (argc != 1) usage(); fname = argv[0]; + tmp = strdup(fname); + if (!tmp) { + prterr("strdup"); + exit(1); + } + bname = basename(tmp); signal(SIGHUP, cleanup); signal(SIGINT, cleanup); @@ -1795,14 +1887,14 @@ main(int argc, char **argv) } } #endif - strncat(goodfile, fname, 256); + strncat(goodfile, bname, 256); strcat (goodfile, ".fsxgood"); fsxgoodfd = open(goodfile, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fsxgoodfd < 0) { prterr(goodfile); exit(92); } - strncat(logfile, fname, 256); + strncat(logfile, bname, 256); strcat (logfile, ".fsxlog"); fsxlogf = fopen(logfile, "w"); if (fsxlogf == NULL) { @@ -1874,6 +1966,7 @@ main(int argc, char **argv) while (numops == -1 || numops--) test(); + free(tmp); if (close(fd)) { prterr("close"); report_failure(99); diff --git a/tests/generic/326 b/tests/generic/326 new file mode 100644 index 0000000..b4346e6 --- /dev/null +++ b/tests/generic/326 @@ -0,0 +1,130 @@ +#! /bin/bash +# FS QA Test No. 326 +# +# Run fsx with log writes to verify power fail safeness. +# +#----------------------------------------------------------------------- +# Copyright (c) 2015 Facebook. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_log_writes +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmlogwrites + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch_nocheck +_require_dm_log_writes + +rm -f $seqres.full +rm -rf $TEST_DIR/fsxtests + +_check_files() +{ + _name=$1 + # Now look for our files + for i in $(find $SANITY_DIR -type f | grep $_name | grep mark) + do + filename=$(basename $i) + mark="${filename##*.}" + echo "checking $filename" >> $seqres.full + _replay_log $filename + _scratch_mount + expected_md5=$(md5sum $i | cut -f 1 -d ' ') + md5=$(md5sum $SCRATCH_MNT/$_name | cut -f 1 -d ' ') + [ "${md5}x" != "${expected_md5}x" ] && _fatal "md5sum mismatched" + _scratch_unmount + _check_scratch_fs + done +} + +SANITY_DIR=$TEST_DIR/fsxtests +mkdir $SANITY_DIR + +# Create the log +_init_log_writes + +_log_writes_mkfs >> $seqres.full 2>&1 + +# Log writes emulates discard support, turn it on for maximum crying. +_mount_log_writes -o discard + +FSX_OPTS="" +[ $(_test_falloc_support "fpunch") ] || FSX_OPTS="-H" +[ $(_test_falloc_support "fcollapse") ] || FSX_OPTS="$FSX_OPTS -C" +[ $(_test_falloc_support "fzero") ] || FSX_OPTS="$FSX_OPTS -z" +[ $(_test_falloc_support "finsert") ] || FSX_OPTS="$FSX_OPTS -I" + +# Run fsx for a while +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile1 & +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile2 & +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile3 & +run_check $here/ltp/fsx -P $SANITY_DIR -N 300 -S 0 -i $LOGWRITES_DMDEV \ + $FSX_OPTS $SCRATCH_MNT/testfile4 & +wait +test1_md5=$(md5sum $SCRATCH_MNT/testfile1 | cut -f 1 -d ' ') +test2_md5=$(md5sum $SCRATCH_MNT/testfile2 | cut -f 1 -d ' ') +test3_md5=$(md5sum $SCRATCH_MNT/testfile3 | cut -f 1 -d ' ') +test4_md5=$(md5sum $SCRATCH_MNT/testfile4 | cut -f 1 -d ' ') + +# Unmount the scratch dir and tear down the log writes target +_unmount_log_writes +_log_writes_mark end +_log_writes_remove + +for i in testfile1 testfile2 testfile3 testfile4 +do + _check_files $i +done + +# Check the end +_replay_log end +_scratch_mount +md5=$(md5sum $SCRATCH_MNT/testfile1 | cut -f 1 -d ' ') +[ "${md5}x" != "${test1_md5}x" ] && _fatal "testfile1 end md5sum mismatched" +md5=$(md5sum $SCRATCH_MNT/testfile2 | cut -f 1 -d ' ') +[ "${md5}x" != "${test2_md5}x" ] && _fatal "testfile2 end md5sum mismatched" +md5=$(md5sum $SCRATCH_MNT/testfile3 | cut -f 1 -d ' ') +[ "${md5}x" != "${test3_md5}x" ] && _fatal "testfile3 end md5sum mismatched" +md5=$(md5sum $SCRATCH_MNT/testfile4 | cut -f 1 -d ' ') +[ "${md5}x" != "${test4_md5}x" ] && _fatal "testfile4 end md5sum mismatched" +_scratch_unmount +_check_scratch_fs + +echo "Silence is golden" +status=0 +exit + diff --git a/tests/generic/326.out b/tests/generic/326.out new file mode 100644 index 0000000..4ac0db5 --- /dev/null +++ b/tests/generic/326.out @@ -0,0 +1,2 @@ +QA output created by 326 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index d56d3ce..31e5f7d 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -183,3 +183,4 @@ 323 auto aio stress 324 auto fsr quick 325 auto quick data log +326 auto log -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/3] fstests: btrfs balance with dm log writes test 2015-03-19 20:31 ` Josef Bacik @ 2015-03-19 20:31 ` Josef Bacik -1 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests This test runs fsstress+balance+defrag and then replays every FUA in the log and mounts, scrubs and then fscks the fs to make sure it does the balance recovery properly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- tests/btrfs/083 | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/btrfs/083.out | 1 + tests/btrfs/group | 1 + 3 files changed, 137 insertions(+) create mode 100644 tests/btrfs/083 create mode 100644 tests/btrfs/083.out diff --git a/tests/btrfs/083 b/tests/btrfs/083 new file mode 100644 index 0000000..66118b9 --- /dev/null +++ b/tests/btrfs/083 @@ -0,0 +1,135 @@ +#! /bin/bash +# FSQA Test No. btrfs/083 +# +# Run btrfs balance and defrag operations simultaneously with fsstress +# running in background on top of dm-log-writes. +# +#----------------------------------------------------------------------- +# Copyright (C) 2015 Facebook. All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmlogwrites + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +# we check scratch dev after each loop +_need_to_be_root +_require_scratch_nocheck +_require_dm_log_writes + +rm -f $seqres.full + +_wait_balance() +{ + while [ 1 ] + do + $BTRFS_UTIL_PROG filesystem balance status $SCRATCH_MNT \ + | grep "No balance" >> $seqres.full + [ $? -eq 0 ] && break + sleep 1 + done +} + +run_test() +{ + args=`_scale_fsstress_args -p 20 -n 100 $FSSTRESS_AVOID -d $SCRATCH_MNT/stressdir` + echo "Run fsstress $args" >>$seqres.full + $FSSTRESS_PROG $args >/dev/null 2>&1 & + fsstress_pid=$! + + echo -n "Start balance worker: " >>$seqres.full + _btrfs_stress_balance $SCRATCH_MNT >/dev/null 2>&1 & + balance_pid=$! + echo "$balance_pid" >>$seqres.full + + echo -n "Start defrag worker: " >>$seqres.full + _btrfs_stress_defrag $SCRATCH_MNT $with_compress >/dev/null 2>&1 & + defrag_pid=$! + echo "$defrag_pid" >>$seqres.full + + echo "Wait for fsstress to exit and kill all background workers" >>$seqres.full + wait $fsstress_pid + kill $balance_pid $defrag_pid + wait + # wait for the balance and defrag operations to finish + while ps aux | grep "balance start" | grep -qv grep; do + sleep 1 + done + while ps aux | grep "btrfs filesystem defrag" | grep -qv grep; do + sleep 1 + done +} + +_init_log_writes + +_log_writes_mkfs >> $seqres.full 2>&1 + +_mount_log_writes + +run_test "$t" nocompress + +_unmount_log_writes +_log_writes_remove + +# Get the number of entries in the log +NUM_ENTRIES=$($REPLAYLOG_PROG --log $LOGWRITES_DEV --num-entries) + +# Start at the first FUA after the mkfs +ENTRY=$($REPLAYLOG_PROG --log $LOGWRITES_DEV --start-mark mkfs \ + --find --next-fua) + +while [ $ENTRY -lt $NUM_ENTRIES ]; +do + echo "Replaying to $ENTRY" >> $seqres.full + $REPLAYLOG_PROG --log $LOGWRITES_DEV --replay $SCRATCH_DEV --limit \ + $ENTRY > /dev/null 2>&1 + [ $? -ne 0 ] && _fatal "replay failed" + btrfsck $SCRATCH_DEV >> $seqres.full 2>&1 || _fatal "btrfsck failed" + _scratch_mount || _fatal "mount failed" + _wait_balance + $BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >> $seqres.full 2>&1 + [ $? -ne 0 ] && _fatal "scrub failed" + _scratch_unmount + btrfsck $SCRATCH_DEV >> $seqres.full 2>&1 || _fatal "btrfsck failed" + let ENTRY+=1 + ENTRY=$($REPLAYLOG_PROG --find --start-entry $ENTRY --log \ + $LOGWRITES_DEV --next-fua) +done + +status=0 +exit + diff --git a/tests/btrfs/083.out b/tests/btrfs/083.out new file mode 100644 index 0000000..b675a31 --- /dev/null +++ b/tests/btrfs/083.out @@ -0,0 +1 @@ +QA output created by 083 diff --git a/tests/btrfs/group b/tests/btrfs/group index fd2fa76..88719ca 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -85,3 +85,4 @@ 080 auto snapshot 081 auto quick clone 082 auto quick remount +083 auto log -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/3] fstests: btrfs balance with dm log writes test @ 2015-03-19 20:31 ` Josef Bacik 0 siblings, 0 replies; 22+ messages in thread From: Josef Bacik @ 2015-03-19 20:31 UTC (permalink / raw) To: linux-btrfs, linux-fsdevel, dm-devel, zab, fstests This test runs fsstress+balance+defrag and then replays every FUA in the log and mounts, scrubs and then fscks the fs to make sure it does the balance recovery properly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- tests/btrfs/083 | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/btrfs/083.out | 1 + tests/btrfs/group | 1 + 3 files changed, 137 insertions(+) create mode 100644 tests/btrfs/083 create mode 100644 tests/btrfs/083.out diff --git a/tests/btrfs/083 b/tests/btrfs/083 new file mode 100644 index 0000000..66118b9 --- /dev/null +++ b/tests/btrfs/083 @@ -0,0 +1,135 @@ +#! /bin/bash +# FSQA Test No. btrfs/083 +# +# Run btrfs balance and defrag operations simultaneously with fsstress +# running in background on top of dm-log-writes. +# +#----------------------------------------------------------------------- +# Copyright (C) 2015 Facebook. All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmlogwrites + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +# we check scratch dev after each loop +_need_to_be_root +_require_scratch_nocheck +_require_dm_log_writes + +rm -f $seqres.full + +_wait_balance() +{ + while [ 1 ] + do + $BTRFS_UTIL_PROG filesystem balance status $SCRATCH_MNT \ + | grep "No balance" >> $seqres.full + [ $? -eq 0 ] && break + sleep 1 + done +} + +run_test() +{ + args=`_scale_fsstress_args -p 20 -n 100 $FSSTRESS_AVOID -d $SCRATCH_MNT/stressdir` + echo "Run fsstress $args" >>$seqres.full + $FSSTRESS_PROG $args >/dev/null 2>&1 & + fsstress_pid=$! + + echo -n "Start balance worker: " >>$seqres.full + _btrfs_stress_balance $SCRATCH_MNT >/dev/null 2>&1 & + balance_pid=$! + echo "$balance_pid" >>$seqres.full + + echo -n "Start defrag worker: " >>$seqres.full + _btrfs_stress_defrag $SCRATCH_MNT $with_compress >/dev/null 2>&1 & + defrag_pid=$! + echo "$defrag_pid" >>$seqres.full + + echo "Wait for fsstress to exit and kill all background workers" >>$seqres.full + wait $fsstress_pid + kill $balance_pid $defrag_pid + wait + # wait for the balance and defrag operations to finish + while ps aux | grep "balance start" | grep -qv grep; do + sleep 1 + done + while ps aux | grep "btrfs filesystem defrag" | grep -qv grep; do + sleep 1 + done +} + +_init_log_writes + +_log_writes_mkfs >> $seqres.full 2>&1 + +_mount_log_writes + +run_test "$t" nocompress + +_unmount_log_writes +_log_writes_remove + +# Get the number of entries in the log +NUM_ENTRIES=$($REPLAYLOG_PROG --log $LOGWRITES_DEV --num-entries) + +# Start at the first FUA after the mkfs +ENTRY=$($REPLAYLOG_PROG --log $LOGWRITES_DEV --start-mark mkfs \ + --find --next-fua) + +while [ $ENTRY -lt $NUM_ENTRIES ]; +do + echo "Replaying to $ENTRY" >> $seqres.full + $REPLAYLOG_PROG --log $LOGWRITES_DEV --replay $SCRATCH_DEV --limit \ + $ENTRY > /dev/null 2>&1 + [ $? -ne 0 ] && _fatal "replay failed" + btrfsck $SCRATCH_DEV >> $seqres.full 2>&1 || _fatal "btrfsck failed" + _scratch_mount || _fatal "mount failed" + _wait_balance + $BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >> $seqres.full 2>&1 + [ $? -ne 0 ] && _fatal "scrub failed" + _scratch_unmount + btrfsck $SCRATCH_DEV >> $seqres.full 2>&1 || _fatal "btrfsck failed" + let ENTRY+=1 + ENTRY=$($REPLAYLOG_PROG --find --start-entry $ENTRY --log \ + $LOGWRITES_DEV --next-fua) +done + +status=0 +exit + diff --git a/tests/btrfs/083.out b/tests/btrfs/083.out new file mode 100644 index 0000000..b675a31 --- /dev/null +++ b/tests/btrfs/083.out @@ -0,0 +1 @@ +QA output created by 083 diff --git a/tests/btrfs/group b/tests/btrfs/group index fd2fa76..88719ca 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -85,3 +85,4 @@ 080 auto snapshot 081 auto quick clone 082 auto quick remount +083 auto log -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] fstests: btrfs balance with dm log writes test 2015-03-19 20:31 ` Josef Bacik (?) @ 2015-03-25 10:35 ` Filipe David Manana -1 siblings, 0 replies; 22+ messages in thread From: Filipe David Manana @ 2015-03-25 10:35 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, linux-fsdevel, dm-devel, Zach Brown, fstests On Thu, Mar 19, 2015 at 8:31 PM, Josef Bacik <jbacik@fb.com> wrote: > This test runs fsstress+balance+defrag and then replays every FUA in the log and > mounts, scrubs and then fscks the fs to make sure it does the balance recovery > properly. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> Looks good, only some minor comments below. Thanks. > --- > tests/btrfs/083 | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > tests/btrfs/083.out | 1 + > tests/btrfs/group | 1 + > 3 files changed, 137 insertions(+) > create mode 100644 tests/btrfs/083 > create mode 100644 tests/btrfs/083.out > > diff --git a/tests/btrfs/083 b/tests/btrfs/083 > new file mode 100644 > index 0000000..66118b9 > --- /dev/null > +++ b/tests/btrfs/083 > @@ -0,0 +1,135 @@ > +#! /bin/bash > +# FSQA Test No. btrfs/083 > +# > +# Run btrfs balance and defrag operations simultaneously with fsstress > +# running in background on top of dm-log-writes. > +# > +#----------------------------------------------------------------------- > +# Copyright (C) 2015 Facebook. All rights reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +# > +#----------------------------------------------------------------------- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/dmlogwrites > + > +# real QA test starts here > +_supported_fs btrfs > +_supported_os Linux > +# we check scratch dev after each loop > +_need_to_be_root > +_require_scratch_nocheck > +_require_dm_log_writes > + > +rm -f $seqres.full > + > +_wait_balance() > +{ > + while [ 1 ] > + do Generally the style used in fstests is: while X; do > + $BTRFS_UTIL_PROG filesystem balance status $SCRATCH_MNT \ > + | grep "No balance" >> $seqres.full > + [ $? -eq 0 ] && break > + sleep 1 > + done > +} > + > +run_test() > +{ > + args=`_scale_fsstress_args -p 20 -n 100 $FSSTRESS_AVOID -d $SCRATCH_MNT/stressdir` > + echo "Run fsstress $args" >>$seqres.full > + $FSSTRESS_PROG $args >/dev/null 2>&1 & > + fsstress_pid=$! > + > + echo -n "Start balance worker: " >>$seqres.full > + _btrfs_stress_balance $SCRATCH_MNT >/dev/null 2>&1 & > + balance_pid=$! > + echo "$balance_pid" >>$seqres.full > + > + echo -n "Start defrag worker: " >>$seqres.full > + _btrfs_stress_defrag $SCRATCH_MNT $with_compress >/dev/null 2>&1 & > + defrag_pid=$! > + echo "$defrag_pid" >>$seqres.full > + > + echo "Wait for fsstress to exit and kill all background workers" >>$seqres.full > + wait $fsstress_pid > + kill $balance_pid $defrag_pid > + wait > + # wait for the balance and defrag operations to finish > + while ps aux | grep "balance start" | grep -qv grep; do > + sleep 1 > + done > + while ps aux | grep "btrfs filesystem defrag" | grep -qv grep; do > + sleep 1 > + done > +} > + > +_init_log_writes > + > +_log_writes_mkfs >> $seqres.full 2>&1 > + > +_mount_log_writes > + > +run_test "$t" nocompress The arguments passed to run_test don't seem to be used anywhere (nor $t defined). > + > +_unmount_log_writes > +_log_writes_remove > + > +# Get the number of entries in the log > +NUM_ENTRIES=$($REPLAYLOG_PROG --log $LOGWRITES_DEV --num-entries) > + > +# Start at the first FUA after the mkfs > +ENTRY=$($REPLAYLOG_PROG --log $LOGWRITES_DEV --start-mark mkfs \ > + --find --next-fua) > + > +while [ $ENTRY -lt $NUM_ENTRIES ]; > +do Same as above. > + echo "Replaying to $ENTRY" >> $seqres.full > + $REPLAYLOG_PROG --log $LOGWRITES_DEV --replay $SCRATCH_DEV --limit \ > + $ENTRY > /dev/null 2>&1 > + [ $? -ne 0 ] && _fatal "replay failed" > + btrfsck $SCRATCH_DEV >> $seqres.full 2>&1 || _fatal "btrfsck failed" Any reason to not use _check_scratch_fs instead? > + _scratch_mount || _fatal "mount failed" > + _wait_balance > + $BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >> $seqres.full 2>&1 > + [ $? -ne 0 ] && _fatal "scrub failed" > + _scratch_unmount > + btrfsck $SCRATCH_DEV >> $seqres.full 2>&1 || _fatal "btrfsck failed" Same as above. > + let ENTRY+=1 > + ENTRY=$($REPLAYLOG_PROG --find --start-entry $ENTRY --log \ > + $LOGWRITES_DEV --next-fua) > +done > + > +status=0 > +exit > + > diff --git a/tests/btrfs/083.out b/tests/btrfs/083.out > new file mode 100644 > index 0000000..b675a31 > --- /dev/null > +++ b/tests/btrfs/083.out > @@ -0,0 +1 @@ > +QA output created by 083 > diff --git a/tests/btrfs/group b/tests/btrfs/group > index fd2fa76..88719ca 100644 > --- a/tests/btrfs/group > +++ b/tests/btrfs/group > @@ -85,3 +85,4 @@ > 080 auto snapshot > 081 auto quick clone > 082 auto quick remount > +083 auto log > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2015-04-07 14:45 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-19 20:31 [PATCH 0/3] Device mapper log writes patches Josef Bacik 2015-03-19 20:31 ` Josef Bacik 2015-03-19 20:31 ` [PATCH 1/3] dm: log writes target Josef Bacik 2015-03-19 20:31 ` Josef Bacik 2015-03-19 23:16 ` Zach Brown 2015-03-20 14:50 ` [PATCH 1/3] dm: log writes target V2 Josef Bacik 2015-03-20 14:50 ` Josef Bacik 2015-03-20 16:31 ` Zach Brown 2015-03-24 15:33 ` Mike Snitzer 2015-04-07 14:41 ` Josef Bacik 2015-04-07 14:41 ` Josef Bacik 2015-03-21 21:50 ` [PATCH 1/3] dm: log writes target Dave Chinner 2015-04-07 14:43 ` Josef Bacik 2015-04-07 14:43 ` Josef Bacik 2015-03-23 18:02 ` [dm-devel] " Vivek Goyal 2015-04-07 14:45 ` Josef Bacik 2015-04-07 14:45 ` Josef Bacik 2015-03-19 20:31 ` [PATCH 2/3] fstests: add dm-log-writes test and supporting code Josef Bacik 2015-03-19 20:31 ` Josef Bacik 2015-03-19 20:31 ` [PATCH 3/3] fstests: btrfs balance with dm log writes test Josef Bacik 2015-03-19 20:31 ` Josef Bacik 2015-03-25 10:35 ` Filipe David Manana
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.