* [RFC] block integrity: Fix write after checksum calculation problem @ 2011-02-22 2:00 Darrick J. Wong 2011-02-22 5:45 ` Boaz Harrosh ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-02-22 2:00 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Hi all, Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a discussion about how to deal with the situation where one program tells the kernel to write a block to disk, the kernel computes the checksum of that data, and then a second program begins writing to that same block before the disk HBA can DMA the memory block, thereby causing the disk to complain about being sent invalid checksums. By my recollection, several solutions were proposed, such as retrying the I/O with a new checksum; using the VM to track and stop page writes during I/O; creating shadow pages so that subsequent memory writes go to a different page; punting the integrity failure to the app to let it deal with it; and only allowing DIF mode when O_DIRECT is enabled. Unfortunately, I didn't get a sense that any sort of consensus had been reached (much less anything patched into the kernel) and since I was able to write a trivial program to trigger the write problem, I'm pretty sure that this has not been fixed upstream. (FYI, using O_DIRECT still seems fine.) Below is a simple if naive solution: (ab)use the bounce buffering code to copy the memory page just prior to calculating the checksum, and send the copy and the checksum to the disk controller. With this patch applied, the invalid guard tag messages go away. An optimization would be to perform the copy only when memory contents change, but I wanted to ask peoples' opinions before continuing. I don't imagine bounce buffering is particularly speedy, though I haven't noticed any corruption errors or weirdness yet. Anyway, I'm mostly wondering: what do people think of this as a starting point to fixing the DIF checksum problem? -- block integrity: Fix write after integrity checksum calculation problem The kernel does not prohibit writes to a dirty page during a write IO. This leads to the race where a memory write occurs after the integrity data has been calculated but before the hardware initiates the DMA operation; when this happens, the data does not match the checksum and the disk fails the write due to an invalid checksum. An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot a page's contents just prior to calculating the checksum, and sending the copy and its checksum to the hardware. This is probably terrible for performance. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- block/Kconfig | 1 block/blk-core.c | 2 - block/blk-integrity.c | 2 + fs/bio-integrity.c | 12 +++- include/linux/bio.h | 4 + include/linux/blkdev.h | 12 ++++ mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 154 insertions(+), 8 deletions(-) diff --git a/block/Kconfig b/block/Kconfig index 60be1e0..ed89f19 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -66,6 +66,7 @@ config BLK_DEV_BSG If unsure, say Y. config BLK_DEV_INTEGRITY + depends on BOUNCE=y bool "Block layer data integrity support" ---help--- Some storage devices allow extra information to be diff --git a/block/blk-core.c b/block/blk-core.c index 3cc3bed..81a4e50 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio) */ blk_partition_remap(bio); - if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) + if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio)) goto end_io; if (old_sector != -1) diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 54bcba6..24843f8 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) BUG_ON(disk == NULL); + init_integrity_pool(); + if (disk->integrity == NULL) { bi = kmem_cache_alloc(integrity_cachep, GFP_KERNEL | __GFP_ZERO); diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index df4fdfd..9eee904 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi) * block device's integrity function. In the READ case, the buffer * will be prepared for DMA and a suitable end_io handler set up. */ -int bio_integrity_prep(struct bio *bio) +int bio_integrity_prep(struct bio **pbio) { + struct bio *bio; struct bio_integrity_payload *bip; struct blk_integrity *bi; struct request_queue *q; @@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio) unsigned int bytes, offset, i; unsigned int sectors; - bi = bdev_get_integrity(bio->bi_bdev); - q = bdev_get_queue(bio->bi_bdev); + bi = bdev_get_integrity((*pbio)->bi_bdev); + q = bdev_get_queue((*pbio)->bi_bdev); BUG_ON(bi == NULL); - BUG_ON(bio_integrity(bio)); + BUG_ON(bio_integrity(*pbio)); + + blk_queue_integrity_bounce(q, pbio); + bio = *pbio; sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio)); diff --git a/include/linux/bio.h b/include/linux/bio.h index 36d10f0..1834934 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl) for_each_bio(_bio) \ bip_for_each_vec(_bvl, _bio->bi_integrity, _iter) -#define bio_integrity(bio) (bio->bi_integrity != NULL) +#define bio_integrity(bio) ((bio)->bi_integrity != NULL) extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *); extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int); @@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns extern int bio_integrity_enabled(struct bio *bio); extern int bio_integrity_set_tag(struct bio *, void *, unsigned int); extern int bio_integrity_get_tag(struct bio *, void *, unsigned int); -extern int bio_integrity_prep(struct bio *); +extern int bio_integrity_prep(struct bio **); extern void bio_integrity_endio(struct bio *, int); extern void bio_integrity_advance(struct bio *, unsigned int); extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 8a082a5..1c9fc40 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn; #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ) #define BLK_MIN_SG_TIMEOUT (7 * HZ) +#ifdef CONFIG_BLK_DEV_INTEGRITY +extern int init_integrity_pool(void); +extern void blk_queue_integrity_bounce(struct request_queue *q, + struct bio **bio); +#else +#define init_integrity_pool() do { } while (0); +static inline void blk_queue_integrity_bounce(struct request_queue *q, + struct bio **bio) +{ +} +#endif + #ifdef CONFIG_BOUNCE extern int init_emergency_isa_pool(void); extern void blk_queue_bounce(struct request_queue *q, struct bio **bio); diff --git a/mm/bounce.c b/mm/bounce.c index 1481de6..c0ce533 100644 --- a/mm/bounce.c +++ b/mm/bounce.c @@ -21,7 +21,19 @@ #define POOL_SIZE 64 #define ISA_POOL_SIZE 16 -static mempool_t *page_pool, *isa_page_pool; +static mempool_t *integrity_pool, *page_pool, *isa_page_pool; + +int init_integrity_pool(void) +{ + if (integrity_pool) + return 0; + + integrity_pool = mempool_create_page_pool(POOL_SIZE, 0); + BUG_ON(!integrity_pool); + printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE); + + return 0; +} #ifdef CONFIG_HIGHMEM static __init int init_emergency_pool(void) @@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err) bounce_end_io(bio, page_pool, err); } +static void bounce_end_integrity_io_write(struct bio *bio, int err) +{ + bounce_end_io(bio, integrity_pool, err); +} + static void bounce_end_io_write_isa(struct bio *bio, int err) { @@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig) } EXPORT_SYMBOL(blk_queue_bounce); + +/* + * Fix the problem of pages getting dirtied after the integrity checksum + * calculation by copying all writes to a shadow buffer prior to computing + * checksums. + */ +static void __blk_queue_integrity_bounce(struct request_queue *q, + struct bio **bio_orig, + mempool_t *pool) +{ + struct page *page; + struct bio *bio = NULL; + int i; + struct bio_vec *to, *from; + + bio_for_each_segment(from, *bio_orig, i) { + char *vto, *vfrom; + page = from->bv_page; + + /* + * irk, bounce it + */ + if (!bio) { + unsigned int cnt = (*bio_orig)->bi_vcnt; + + bio = bio_alloc(GFP_NOIO, cnt); + memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec)); + } + + + to = bio->bi_io_vec + i; + + to->bv_page = mempool_alloc(pool, q->bounce_gfp); + to->bv_len = from->bv_len; + to->bv_offset = from->bv_offset; + inc_zone_page_state(to->bv_page, NR_BOUNCE); + + flush_dcache_page(from->bv_page); + vto = page_address(to->bv_page) + to->bv_offset; + vfrom = kmap(from->bv_page) + from->bv_offset; + memcpy(vto, vfrom, to->bv_len); + kunmap(from->bv_page); + } + + /* + * no pages bounced + */ + if (!bio) + return; + + trace_block_bio_bounce(q, *bio_orig); + + /* + * at least one page was bounced, fill in possible non-highmem + * pages + */ + __bio_for_each_segment(from, *bio_orig, i, 0) { + to = bio_iovec_idx(bio, i); + if (!to->bv_page) { + to->bv_page = from->bv_page; + to->bv_len = from->bv_len; + to->bv_offset = from->bv_offset; + } + } + + bio->bi_bdev = (*bio_orig)->bi_bdev; + bio->bi_flags |= (1 << BIO_BOUNCED); + bio->bi_sector = (*bio_orig)->bi_sector; + bio->bi_rw = (*bio_orig)->bi_rw; + + bio->bi_vcnt = (*bio_orig)->bi_vcnt; + bio->bi_idx = (*bio_orig)->bi_idx; + bio->bi_size = (*bio_orig)->bi_size; + + if (pool == integrity_pool) + bio->bi_end_io = bounce_end_integrity_io_write; + else + bio->bi_end_io = bounce_end_io_write_isa; + + bio->bi_private = *bio_orig; + *bio_orig = bio; +} + +void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig) +{ + mempool_t *pool; + + /* only do this for writes */ + if (bio_data_dir(*bio_orig) != WRITE) + return; + /* + * Data-less bio, nothing to bounce + */ + if (!bio_has_data(*bio_orig)) + return; + + if (!(q->bounce_gfp & GFP_DMA)) { + BUG_ON(!integrity_pool); + pool = integrity_pool; + } else { + BUG_ON(!isa_page_pool); + pool = isa_page_pool; + } + + /* + * slow path + */ + __blk_queue_integrity_bounce(q, bio_orig, pool); +} +EXPORT_SYMBOL(blk_queue_integrity_bounce); ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 2:00 [RFC] block integrity: Fix write after checksum calculation problem Darrick J. Wong @ 2011-02-22 5:45 ` Boaz Harrosh 2011-02-22 11:42 ` Jan Kara 2011-02-22 16:13 ` Andreas Dilger 2011-02-22 16:45 ` Martin K. Petersen 2 siblings, 1 reply; 67+ messages in thread From: Boaz Harrosh @ 2011-02-22 5:45 UTC (permalink / raw) To: djwong; +Cc: Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On 02/21/2011 06:00 PM, Darrick J. Wong wrote: > Hi all, > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > discussion about how to deal with the situation where one program tells the > kernel to write a block to disk, the kernel computes the checksum of that data, > and then a second program begins writing to that same block before the disk HBA > can DMA the memory block, thereby causing the disk to complain about being sent > invalid checksums. The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions of ext4 it should work much better. (If you still have problems please report them, those FSs advertise stable pages write-out) This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write and syncing with write-out for example by taking the page-lock. Currently each FS is to itself because in VFS it would force the behaviour on FSs that it does not make sense to. Note that the proper solution does not copy any data, just forces the app to wait before changing write-out pages. Boaz > > By my recollection, several solutions were proposed, such as retrying the I/O > with a new checksum; using the VM to track and stop page writes during I/O; > creating shadow pages so that subsequent memory writes go to a different page; > punting the integrity failure to the app to let it deal with it; and only > allowing DIF mode when O_DIRECT is enabled. > > Unfortunately, I didn't get a sense that any sort of consensus had been reached > (much less anything patched into the kernel) and since I was able to write a > trivial program to trigger the write problem, I'm pretty sure that this has not > been fixed upstream. (FYI, using O_DIRECT still seems fine.) > > Below is a simple if naive solution: (ab)use the bounce buffering code to copy > the memory page just prior to calculating the checksum, and send the copy and > the checksum to the disk controller. With this patch applied, the invalid > guard tag messages go away. An optimization would be to perform the copy only > when memory contents change, but I wanted to ask peoples' opinions before > continuing. I don't imagine bounce buffering is particularly speedy, though I > haven't noticed any corruption errors or weirdness yet. > > Anyway, I'm mostly wondering: what do people think of this as a starting point > to fixing the DIF checksum problem? > > -- > block integrity: Fix write after integrity checksum calculation problem > > The kernel does not prohibit writes to a dirty page during a write IO. This > leads to the race where a memory write occurs after the integrity data has been > calculated but before the hardware initiates the DMA operation; when this > happens, the data does not match the checksum and the disk fails the write due > to an invalid checksum. > > An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot > a page's contents just prior to calculating the checksum, and sending the copy > and its checksum to the hardware. This is probably terrible for performance. > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > > block/Kconfig | 1 > block/blk-core.c | 2 - > block/blk-integrity.c | 2 + > fs/bio-integrity.c | 12 +++- > include/linux/bio.h | 4 + > include/linux/blkdev.h | 12 ++++ > mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 154 insertions(+), 8 deletions(-) > > diff --git a/block/Kconfig b/block/Kconfig > index 60be1e0..ed89f19 100644 > --- a/block/Kconfig > +++ b/block/Kconfig > @@ -66,6 +66,7 @@ config BLK_DEV_BSG > If unsure, say Y. > > config BLK_DEV_INTEGRITY > + depends on BOUNCE=y > bool "Block layer data integrity support" > ---help--- > Some storage devices allow extra information to be > diff --git a/block/blk-core.c b/block/blk-core.c > index 3cc3bed..81a4e50 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio) > */ > blk_partition_remap(bio); > > - if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) > + if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio)) > goto end_io; > > if (old_sector != -1) > diff --git a/block/blk-integrity.c b/block/blk-integrity.c > index 54bcba6..24843f8 100644 > --- a/block/blk-integrity.c > +++ b/block/blk-integrity.c > @@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) > > BUG_ON(disk == NULL); > > + init_integrity_pool(); > + > if (disk->integrity == NULL) { > bi = kmem_cache_alloc(integrity_cachep, > GFP_KERNEL | __GFP_ZERO); > diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c > index df4fdfd..9eee904 100644 > --- a/fs/bio-integrity.c > +++ b/fs/bio-integrity.c > @@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi) > * block device's integrity function. In the READ case, the buffer > * will be prepared for DMA and a suitable end_io handler set up. > */ > -int bio_integrity_prep(struct bio *bio) > +int bio_integrity_prep(struct bio **pbio) > { > + struct bio *bio; > struct bio_integrity_payload *bip; > struct blk_integrity *bi; > struct request_queue *q; > @@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio) > unsigned int bytes, offset, i; > unsigned int sectors; > > - bi = bdev_get_integrity(bio->bi_bdev); > - q = bdev_get_queue(bio->bi_bdev); > + bi = bdev_get_integrity((*pbio)->bi_bdev); > + q = bdev_get_queue((*pbio)->bi_bdev); > BUG_ON(bi == NULL); > - BUG_ON(bio_integrity(bio)); > + BUG_ON(bio_integrity(*pbio)); > + > + blk_queue_integrity_bounce(q, pbio); > + bio = *pbio; > > sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio)); > > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 36d10f0..1834934 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl) > for_each_bio(_bio) \ > bip_for_each_vec(_bvl, _bio->bi_integrity, _iter) > > -#define bio_integrity(bio) (bio->bi_integrity != NULL) > +#define bio_integrity(bio) ((bio)->bi_integrity != NULL) > > extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *); > extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int); > @@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns > extern int bio_integrity_enabled(struct bio *bio); > extern int bio_integrity_set_tag(struct bio *, void *, unsigned int); > extern int bio_integrity_get_tag(struct bio *, void *, unsigned int); > -extern int bio_integrity_prep(struct bio *); > +extern int bio_integrity_prep(struct bio **); > extern void bio_integrity_endio(struct bio *, int); > extern void bio_integrity_advance(struct bio *, unsigned int); > extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int); > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 8a082a5..1c9fc40 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn; > #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ) > #define BLK_MIN_SG_TIMEOUT (7 * HZ) > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > +extern int init_integrity_pool(void); > +extern void blk_queue_integrity_bounce(struct request_queue *q, > + struct bio **bio); > +#else > +#define init_integrity_pool() do { } while (0); > +static inline void blk_queue_integrity_bounce(struct request_queue *q, > + struct bio **bio) > +{ > +} > +#endif > + > #ifdef CONFIG_BOUNCE > extern int init_emergency_isa_pool(void); > extern void blk_queue_bounce(struct request_queue *q, struct bio **bio); > diff --git a/mm/bounce.c b/mm/bounce.c > index 1481de6..c0ce533 100644 > --- a/mm/bounce.c > +++ b/mm/bounce.c > @@ -21,7 +21,19 @@ > #define POOL_SIZE 64 > #define ISA_POOL_SIZE 16 > > -static mempool_t *page_pool, *isa_page_pool; > +static mempool_t *integrity_pool, *page_pool, *isa_page_pool; > + > +int init_integrity_pool(void) > +{ > + if (integrity_pool) > + return 0; > + > + integrity_pool = mempool_create_page_pool(POOL_SIZE, 0); > + BUG_ON(!integrity_pool); > + printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE); > + > + return 0; > +} > > #ifdef CONFIG_HIGHMEM > static __init int init_emergency_pool(void) > @@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err) > bounce_end_io(bio, page_pool, err); > } > > +static void bounce_end_integrity_io_write(struct bio *bio, int err) > +{ > + bounce_end_io(bio, integrity_pool, err); > +} > + > static void bounce_end_io_write_isa(struct bio *bio, int err) > { > > @@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig) > } > > EXPORT_SYMBOL(blk_queue_bounce); > + > +/* > + * Fix the problem of pages getting dirtied after the integrity checksum > + * calculation by copying all writes to a shadow buffer prior to computing > + * checksums. > + */ > +static void __blk_queue_integrity_bounce(struct request_queue *q, > + struct bio **bio_orig, > + mempool_t *pool) > +{ > + struct page *page; > + struct bio *bio = NULL; > + int i; > + struct bio_vec *to, *from; > + > + bio_for_each_segment(from, *bio_orig, i) { > + char *vto, *vfrom; > + page = from->bv_page; > + > + /* > + * irk, bounce it > + */ > + if (!bio) { > + unsigned int cnt = (*bio_orig)->bi_vcnt; > + > + bio = bio_alloc(GFP_NOIO, cnt); > + memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec)); > + } > + > + > + to = bio->bi_io_vec + i; > + > + to->bv_page = mempool_alloc(pool, q->bounce_gfp); > + to->bv_len = from->bv_len; > + to->bv_offset = from->bv_offset; > + inc_zone_page_state(to->bv_page, NR_BOUNCE); > + > + flush_dcache_page(from->bv_page); > + vto = page_address(to->bv_page) + to->bv_offset; > + vfrom = kmap(from->bv_page) + from->bv_offset; > + memcpy(vto, vfrom, to->bv_len); > + kunmap(from->bv_page); > + } > + > + /* > + * no pages bounced > + */ > + if (!bio) > + return; > + > + trace_block_bio_bounce(q, *bio_orig); > + > + /* > + * at least one page was bounced, fill in possible non-highmem > + * pages > + */ > + __bio_for_each_segment(from, *bio_orig, i, 0) { > + to = bio_iovec_idx(bio, i); > + if (!to->bv_page) { > + to->bv_page = from->bv_page; > + to->bv_len = from->bv_len; > + to->bv_offset = from->bv_offset; > + } > + } > + > + bio->bi_bdev = (*bio_orig)->bi_bdev; > + bio->bi_flags |= (1 << BIO_BOUNCED); > + bio->bi_sector = (*bio_orig)->bi_sector; > + bio->bi_rw = (*bio_orig)->bi_rw; > + > + bio->bi_vcnt = (*bio_orig)->bi_vcnt; > + bio->bi_idx = (*bio_orig)->bi_idx; > + bio->bi_size = (*bio_orig)->bi_size; > + > + if (pool == integrity_pool) > + bio->bi_end_io = bounce_end_integrity_io_write; > + else > + bio->bi_end_io = bounce_end_io_write_isa; > + > + bio->bi_private = *bio_orig; > + *bio_orig = bio; > +} > + > +void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig) > +{ > + mempool_t *pool; > + > + /* only do this for writes */ > + if (bio_data_dir(*bio_orig) != WRITE) > + return; > + /* > + * Data-less bio, nothing to bounce > + */ > + if (!bio_has_data(*bio_orig)) > + return; > + > + if (!(q->bounce_gfp & GFP_DMA)) { > + BUG_ON(!integrity_pool); > + pool = integrity_pool; > + } else { > + BUG_ON(!isa_page_pool); > + pool = isa_page_pool; > + } > + > + /* > + * slow path > + */ > + __blk_queue_integrity_bounce(q, bio_orig, pool); > +} > +EXPORT_SYMBOL(blk_queue_integrity_bounce); > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 5:45 ` Boaz Harrosh @ 2011-02-22 11:42 ` Jan Kara 2011-02-22 13:02 ` Chris Mason 2011-03-04 20:51 ` Darrick J. Wong 0 siblings, 2 replies; 67+ messages in thread From: Jan Kara @ 2011-02-22 11:42 UTC (permalink / raw) To: Boaz Harrosh Cc: djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Hi Boaz, On Mon 21-02-11 21:45:51, Boaz Harrosh wrote: > On 02/21/2011 06:00 PM, Darrick J. Wong wrote: > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > > discussion about how to deal with the situation where one program tells the > > kernel to write a block to disk, the kernel computes the checksum of that data, > > and then a second program begins writing to that same block before the disk HBA > > can DMA the memory block, thereby causing the disk to complain about being sent > > invalid checksums. > > The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions > of ext4 it should work much better. (If you still have problems please report > them, those FSs advertise stable pages write-out) Do they? I've just checked ext4 and xfs and they don't seem to enforce stable pages. They do lock the page (which implicitely happens in mm code for any filesystem BTW) but this is not enough. You have to wait for PageWriteback to get cleared and only btrfs does that. > This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write > and syncing with write-out for example by taking the page-lock. Currently each > FS is to itself because in VFS it would force the behaviour on FSs that it does > not make sense to. Yes, it's easy to fix but at a performance cost for any application doing frequent rewrites regardless whether integrity features are used or not. And I don't think that's a good thing. I even remember someone measured the hit last time this came up and it was rather noticeable. > Note that the proper solution does not copy any data, just forces the app to > wait before changing write-out pages. I think that's up for discussion. In fact what is going to be faster depends pretty much on your system config. If you have enough CPU/RAM bandwidth compared to storage speed, you're better of doing copying. If you can barely saturate storage with your CPU/RAM, waiting is probably better for you. Moreover if you do data copyout, you push the performance cost only on users of the integrity feature which is nice. But on the other hand users of integrity take the cost even if they are not doing rewrites. A solution which is technically plausible and penalizing only rewrites of data-integrity protected pages would be a use of shadow pages as Darrick describes below. So I'd lean towards that long term. But for now I think Darrick's solution is OK to make the integrity feature actually useful and later someone can try something more clever. Honza > > By my recollection, several solutions were proposed, such as retrying the I/O > > with a new checksum; using the VM to track and stop page writes during I/O; > > creating shadow pages so that subsequent memory writes go to a different page; > > punting the integrity failure to the app to let it deal with it; and only > > allowing DIF mode when O_DIRECT is enabled. > > > > Unfortunately, I didn't get a sense that any sort of consensus had been reached > > (much less anything patched into the kernel) and since I was able to write a > > trivial program to trigger the write problem, I'm pretty sure that this has not > > been fixed upstream. (FYI, using O_DIRECT still seems fine.) > > > > Below is a simple if naive solution: (ab)use the bounce buffering code to copy > > the memory page just prior to calculating the checksum, and send the copy and > > the checksum to the disk controller. With this patch applied, the invalid > > guard tag messages go away. An optimization would be to perform the copy only > > when memory contents change, but I wanted to ask peoples' opinions before > > continuing. I don't imagine bounce buffering is particularly speedy, though I > > haven't noticed any corruption errors or weirdness yet. > > > > Anyway, I'm mostly wondering: what do people think of this as a starting point > > to fixing the DIF checksum problem? > > > > -- > > block integrity: Fix write after integrity checksum calculation problem > > > > The kernel does not prohibit writes to a dirty page during a write IO. This > > leads to the race where a memory write occurs after the integrity data has been > > calculated but before the hardware initiates the DMA operation; when this > > happens, the data does not match the checksum and the disk fails the write due > > to an invalid checksum. > > > > An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot > > a page's contents just prior to calculating the checksum, and sending the copy > > and its checksum to the hardware. This is probably terrible for performance. > > > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > > --- > > > > block/Kconfig | 1 > > block/blk-core.c | 2 - > > block/blk-integrity.c | 2 + > > fs/bio-integrity.c | 12 +++- > > include/linux/bio.h | 4 + > > include/linux/blkdev.h | 12 ++++ > > mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++ > > 7 files changed, 154 insertions(+), 8 deletions(-) > > > > diff --git a/block/Kconfig b/block/Kconfig > > index 60be1e0..ed89f19 100644 > > --- a/block/Kconfig > > +++ b/block/Kconfig > > @@ -66,6 +66,7 @@ config BLK_DEV_BSG > > If unsure, say Y. > > > > config BLK_DEV_INTEGRITY > > + depends on BOUNCE=y > > bool "Block layer data integrity support" > > ---help--- > > Some storage devices allow extra information to be > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 3cc3bed..81a4e50 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio) > > */ > > blk_partition_remap(bio); > > > > - if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) > > + if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio)) > > goto end_io; > > > > if (old_sector != -1) > > diff --git a/block/blk-integrity.c b/block/blk-integrity.c > > index 54bcba6..24843f8 100644 > > --- a/block/blk-integrity.c > > +++ b/block/blk-integrity.c > > @@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) > > > > BUG_ON(disk == NULL); > > > > + init_integrity_pool(); > > + > > if (disk->integrity == NULL) { > > bi = kmem_cache_alloc(integrity_cachep, > > GFP_KERNEL | __GFP_ZERO); > > diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c > > index df4fdfd..9eee904 100644 > > --- a/fs/bio-integrity.c > > +++ b/fs/bio-integrity.c > > @@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi) > > * block device's integrity function. In the READ case, the buffer > > * will be prepared for DMA and a suitable end_io handler set up. > > */ > > -int bio_integrity_prep(struct bio *bio) > > +int bio_integrity_prep(struct bio **pbio) > > { > > + struct bio *bio; > > struct bio_integrity_payload *bip; > > struct blk_integrity *bi; > > struct request_queue *q; > > @@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio) > > unsigned int bytes, offset, i; > > unsigned int sectors; > > > > - bi = bdev_get_integrity(bio->bi_bdev); > > - q = bdev_get_queue(bio->bi_bdev); > > + bi = bdev_get_integrity((*pbio)->bi_bdev); > > + q = bdev_get_queue((*pbio)->bi_bdev); > > BUG_ON(bi == NULL); > > - BUG_ON(bio_integrity(bio)); > > + BUG_ON(bio_integrity(*pbio)); > > + > > + blk_queue_integrity_bounce(q, pbio); > > + bio = *pbio; > > > > sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio)); > > > > diff --git a/include/linux/bio.h b/include/linux/bio.h > > index 36d10f0..1834934 100644 > > --- a/include/linux/bio.h > > +++ b/include/linux/bio.h > > @@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl) > > for_each_bio(_bio) \ > > bip_for_each_vec(_bvl, _bio->bi_integrity, _iter) > > > > -#define bio_integrity(bio) (bio->bi_integrity != NULL) > > +#define bio_integrity(bio) ((bio)->bi_integrity != NULL) > > > > extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *); > > extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int); > > @@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns > > extern int bio_integrity_enabled(struct bio *bio); > > extern int bio_integrity_set_tag(struct bio *, void *, unsigned int); > > extern int bio_integrity_get_tag(struct bio *, void *, unsigned int); > > -extern int bio_integrity_prep(struct bio *); > > +extern int bio_integrity_prep(struct bio **); > > extern void bio_integrity_endio(struct bio *, int); > > extern void bio_integrity_advance(struct bio *, unsigned int); > > extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int); > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index 8a082a5..1c9fc40 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn; > > #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ) > > #define BLK_MIN_SG_TIMEOUT (7 * HZ) > > > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > > +extern int init_integrity_pool(void); > > +extern void blk_queue_integrity_bounce(struct request_queue *q, > > + struct bio **bio); > > +#else > > +#define init_integrity_pool() do { } while (0); > > +static inline void blk_queue_integrity_bounce(struct request_queue *q, > > + struct bio **bio) > > +{ > > +} > > +#endif > > + > > #ifdef CONFIG_BOUNCE > > extern int init_emergency_isa_pool(void); > > extern void blk_queue_bounce(struct request_queue *q, struct bio **bio); > > diff --git a/mm/bounce.c b/mm/bounce.c > > index 1481de6..c0ce533 100644 > > --- a/mm/bounce.c > > +++ b/mm/bounce.c > > @@ -21,7 +21,19 @@ > > #define POOL_SIZE 64 > > #define ISA_POOL_SIZE 16 > > > > -static mempool_t *page_pool, *isa_page_pool; > > +static mempool_t *integrity_pool, *page_pool, *isa_page_pool; > > + > > +int init_integrity_pool(void) > > +{ > > + if (integrity_pool) > > + return 0; > > + > > + integrity_pool = mempool_create_page_pool(POOL_SIZE, 0); > > + BUG_ON(!integrity_pool); > > + printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE); > > + > > + return 0; > > +} > > > > #ifdef CONFIG_HIGHMEM > > static __init int init_emergency_pool(void) > > @@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err) > > bounce_end_io(bio, page_pool, err); > > } > > > > +static void bounce_end_integrity_io_write(struct bio *bio, int err) > > +{ > > + bounce_end_io(bio, integrity_pool, err); > > +} > > + > > static void bounce_end_io_write_isa(struct bio *bio, int err) > > { > > > > @@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig) > > } > > > > EXPORT_SYMBOL(blk_queue_bounce); > > + > > +/* > > + * Fix the problem of pages getting dirtied after the integrity checksum > > + * calculation by copying all writes to a shadow buffer prior to computing > > + * checksums. > > + */ > > +static void __blk_queue_integrity_bounce(struct request_queue *q, > > + struct bio **bio_orig, > > + mempool_t *pool) > > +{ > > + struct page *page; > > + struct bio *bio = NULL; > > + int i; > > + struct bio_vec *to, *from; > > + > > + bio_for_each_segment(from, *bio_orig, i) { > > + char *vto, *vfrom; > > + page = from->bv_page; > > + > > + /* > > + * irk, bounce it > > + */ > > + if (!bio) { > > + unsigned int cnt = (*bio_orig)->bi_vcnt; > > + > > + bio = bio_alloc(GFP_NOIO, cnt); > > + memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec)); > > + } > > + > > + > > + to = bio->bi_io_vec + i; > > + > > + to->bv_page = mempool_alloc(pool, q->bounce_gfp); > > + to->bv_len = from->bv_len; > > + to->bv_offset = from->bv_offset; > > + inc_zone_page_state(to->bv_page, NR_BOUNCE); > > + > > + flush_dcache_page(from->bv_page); > > + vto = page_address(to->bv_page) + to->bv_offset; > > + vfrom = kmap(from->bv_page) + from->bv_offset; > > + memcpy(vto, vfrom, to->bv_len); > > + kunmap(from->bv_page); > > + } > > + > > + /* > > + * no pages bounced > > + */ > > + if (!bio) > > + return; > > + > > + trace_block_bio_bounce(q, *bio_orig); > > + > > + /* > > + * at least one page was bounced, fill in possible non-highmem > > + * pages > > + */ > > + __bio_for_each_segment(from, *bio_orig, i, 0) { > > + to = bio_iovec_idx(bio, i); > > + if (!to->bv_page) { > > + to->bv_page = from->bv_page; > > + to->bv_len = from->bv_len; > > + to->bv_offset = from->bv_offset; > > + } > > + } > > + > > + bio->bi_bdev = (*bio_orig)->bi_bdev; > > + bio->bi_flags |= (1 << BIO_BOUNCED); > > + bio->bi_sector = (*bio_orig)->bi_sector; > > + bio->bi_rw = (*bio_orig)->bi_rw; > > + > > + bio->bi_vcnt = (*bio_orig)->bi_vcnt; > > + bio->bi_idx = (*bio_orig)->bi_idx; > > + bio->bi_size = (*bio_orig)->bi_size; > > + > > + if (pool == integrity_pool) > > + bio->bi_end_io = bounce_end_integrity_io_write; > > + else > > + bio->bi_end_io = bounce_end_io_write_isa; > > + > > + bio->bi_private = *bio_orig; > > + *bio_orig = bio; > > +} > > + > > +void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig) > > +{ > > + mempool_t *pool; > > + > > + /* only do this for writes */ > > + if (bio_data_dir(*bio_orig) != WRITE) > > + return; > > + /* > > + * Data-less bio, nothing to bounce > > + */ > > + if (!bio_has_data(*bio_orig)) > > + return; > > + > > + if (!(q->bounce_gfp & GFP_DMA)) { > > + BUG_ON(!integrity_pool); > > + pool = integrity_pool; > > + } else { > > + BUG_ON(!isa_page_pool); > > + pool = isa_page_pool; > > + } > > + > > + /* > > + * slow path > > + */ > > + __blk_queue_integrity_bounce(q, bio_orig, pool); > > +} > > +EXPORT_SYMBOL(blk_queue_integrity_bounce); > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 11:42 ` Jan Kara @ 2011-02-22 13:02 ` Chris Mason 2011-02-22 19:13 ` Boaz Harrosh 2011-03-04 20:51 ` Darrick J. Wong 1 sibling, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-02-22 13:02 UTC (permalink / raw) To: Jan Kara Cc: Boaz Harrosh, djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi [ resend sorry if you get this twice ] Excerpts from Jan Kara's message of 2011-02-22 06:42:22 -0500: > Hi Boaz, > > On Mon 21-02-11 21:45:51, Boaz Harrosh wrote: > > On 02/21/2011 06:00 PM, Darrick J. Wong wrote: > > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > > > discussion about how to deal with the situation where one program tells the > > > kernel to write a block to disk, the kernel computes the checksum of that data, > > > and then a second program begins writing to that same block before the disk HBA > > > can DMA the memory block, thereby causing the disk to complain about being sent > > > invalid checksums. > > > > The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions > > of ext4 it should work much better. (If you still have problems please report > > them, those FSs advertise stable pages write-out) > Do they? I've just checked ext4 and xfs and they don't seem to enforce > stable pages. They do lock the page (which implicitely happens in mm code > for any filesystem BTW) but this is not enough. You have to wait for > PageWriteback to get cleared and only btrfs does that. > > > This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write > > and syncing with write-out for example by taking the page-lock. Currently each > > FS is to itself because in VFS it would force the behaviour on FSs that it does > > not make sense to. > Yes, it's easy to fix but at a performance cost for any application doing > frequent rewrites regardless whether integrity features are used or not. > And I don't think that's a good thing. I even remember someone measured the > hit last time this came up and it was rather noticeable. Do you remember which workload this was? I do remember someone mentioning a specific workload, but can't recall which one now. fsx is definitely slower when we wait for writeback, but that's because it's all evil inside. > > > Note that the proper solution does not copy any data, just forces the app to > > wait before changing write-out pages. > I think that's up for discussion. In fact what is going to be faster > depends pretty much on your system config. If you have enough CPU/RAM > bandwidth compared to storage speed, you're better of doing copying. If > you can barely saturate storage with your CPU/RAM, waiting is probably > better for you. > > Moreover if you do data copyout, you push the performance cost only on > users of the integrity feature which is nice. But on the other hand users > of integrity take the cost even if they are not doing rewrites. > > A solution which is technically plausible and penalizing only rewrites > of data-integrity protected pages would be a use of shadow pages as Darrick > describes below. So I'd lean towards that long term. But for now I think > Darrick's solution is OK to make the integrity feature actually useful and > later someone can try something more clever. Rewrites in flight should be very rare though, and I think the bouncing is going to have a big impact on the intended workloads. It's not just the cost of the copy, it's also the increased time as we beat on the page allocator. We're working on adding stable pages to ext34 and other filesystems missing it. When the work is done we can benchmark and decide on the tradeoffs. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 13:02 ` Chris Mason @ 2011-02-22 19:13 ` Boaz Harrosh 0 siblings, 0 replies; 67+ messages in thread From: Boaz Harrosh @ 2011-02-22 19:13 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On 02/22/2011 05:02 AM, Chris Mason wrote: > [ resend sorry if you get this twice ] > > Excerpts from Jan Kara's message of 2011-02-22 06:42:22 -0500: >> Hi Boaz, >> >> On Mon 21-02-11 21:45:51, Boaz Harrosh wrote: >>> On 02/21/2011 06:00 PM, Darrick J. Wong wrote: >>>> Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 >>>> write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a >>>> discussion about how to deal with the situation where one program tells the >>>> kernel to write a block to disk, the kernel computes the checksum of that data, >>>> and then a second program begins writing to that same block before the disk HBA >>>> can DMA the memory block, thereby causing the disk to complain about being sent >>>> invalid checksums. >>> >>> The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions >>> of ext4 it should work much better. (If you still have problems please report >>> them, those FSs advertise stable pages write-out) >> Do they? I've just checked ext4 and xfs and they don't seem to enforce >> stable pages. They do lock the page (which implicitely happens in mm code >> for any filesystem BTW) but this is not enough. You have to wait for >> PageWriteback to get cleared and only btrfs does that. >> >>> This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write >>> and syncing with write-out for example by taking the page-lock. Currently each >>> FS is to itself because in VFS it would force the behaviour on FSs that it does >>> not make sense to. >> Yes, it's easy to fix but at a performance cost for any application doing >> frequent rewrites regardless whether integrity features are used or not. >> And I don't think that's a good thing. I even remember someone measured the >> hit last time this came up and it was rather noticeable. > > Do you remember which workload this was? I do remember someone > mentioning a specific workload, but can't recall which one now. fsx is > definitely slower when we wait for writeback, but that's because it's > all evil inside. > Me too I have been asking on many occasions on multiple mailing lists and LSF last year, if any one did any benchmarks on this issue, and no one came forward. So please if someone is hiding his resaults come and show us we want to see? I've been Playing with this for my raid5 code and Actually the problem is not that bad. I'm using tar to exercise that. What I can see that I can actually wait on a single page every once in like 15-30 seconds but never consecutively on multiple pages. Because usually you wait on an IOed page but all the pages in an IO are completed together and the reset of the modifications go through. I could not find any measurable performance difference, it was all well inside the margin of the test. But maybe tar is just bad test. >> >>> Note that the proper solution does not copy any data, just forces the app to >>> wait before changing write-out pages. >> I think that's up for discussion. In fact what is going to be faster >> depends pretty much on your system config. If you have enough CPU/RAM >> bandwidth compared to storage speed, you're better of doing copying. If >> you can barely saturate storage with your CPU/RAM, waiting is probably >> better for you. >> >> Moreover if you do data copyout, you push the performance cost only on >> users of the integrity feature which is nice. But on the other hand users >> of integrity take the cost even if they are not doing rewrites. >> >> A solution which is technically plausible and penalizing only rewrites >> of data-integrity protected pages would be a use of shadow pages as Darrick >> describes below. So I'd lean towards that long term. But for now I think >> Darrick's solution is OK to make the integrity feature actually useful and >> later someone can try something more clever. > > Rewrites in flight should be very rare though, and I think the bouncing > is going to have a big impact on the intended workloads. It's not just > the cost of the copy, it's also the increased time as we beat on the > page allocator. > > We're working on adding stable pages to ext34 and other filesystems > missing it. When the work is done we can benchmark and decide on the > tradeoffs. > Thanks, that would be nice. I bet the simple wait on write-out will beat copy-everything , any day of the week. Why don't we put some flags on the BDI that requests stable write-out and devices with diff enabled or the likes of raid1/4/5/6 can turn it on? (Also iscsi integrity checks as well) > -chris Boaz ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 11:42 ` Jan Kara 2011-02-22 13:02 ` Chris Mason @ 2011-03-04 20:51 ` Darrick J. Wong 2011-03-04 20:53 ` Christoph Hellwig 1 sibling, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-03-04 20:51 UTC (permalink / raw) To: Jan Kara Cc: Boaz Harrosh, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Feb 22, 2011 at 12:42:22PM +0100, Jan Kara wrote: > Hi Boaz, > > On Mon 21-02-11 21:45:51, Boaz Harrosh wrote: > > On 02/21/2011 06:00 PM, Darrick J. Wong wrote: > > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > > > discussion about how to deal with the situation where one program tells the > > > kernel to write a block to disk, the kernel computes the checksum of that data, > > > and then a second program begins writing to that same block before the disk HBA > > > can DMA the memory block, thereby causing the disk to complain about being sent > > > invalid checksums. > > > > The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions > > of ext4 it should work much better. (If you still have problems please report > > them, those FSs advertise stable pages write-out) > Do they? I've just checked ext4 and xfs and they don't seem to enforce > stable pages. They do lock the page (which implicitely happens in mm code > for any filesystem BTW) but this is not enough. You have to wait for > PageWriteback to get cleared and only btrfs does that. > > > This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write > > and syncing with write-out for example by taking the page-lock. Currently each > > FS is to itself because in VFS it would force the behaviour on FSs that it does > > not make sense to. > Yes, it's easy to fix but at a performance cost for any application doing > frequent rewrites regardless whether integrity features are used or not. > And I don't think that's a good thing. I even remember someone measured the > hit last time this came up and it was rather noticeable. > > > Note that the proper solution does not copy any data, just forces the app to > > wait before changing write-out pages. > I think that's up for discussion. In fact what is going to be faster > depends pretty much on your system config. If you have enough CPU/RAM > bandwidth compared to storage speed, you're better of doing copying. If > you can barely saturate storage with your CPU/RAM, waiting is probably > better for you. > > Moreover if you do data copyout, you push the performance cost only on > users of the integrity feature which is nice. But on the other hand users > of integrity take the cost even if they are not doing rewrites. > > A solution which is technically plausible and penalizing only rewrites > of data-integrity protected pages would be a use of shadow pages as Darrick > describes below. So I'd lean towards that long term. But for now I think > Darrick's solution is OK to make the integrity feature actually useful and > later someone can try something more clever. Hmm. Any interest in pushing the page copy patch as an interim solution while I work on getting the wait-on-writeback strategy to function? I agree it's not the fastest solution, but at least it won't be running broken while I find the faster solution(s). (More on that writeback patch in a short while.) --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-04 20:51 ` Darrick J. Wong @ 2011-03-04 20:53 ` Christoph Hellwig 0 siblings, 0 replies; 67+ messages in thread From: Christoph Hellwig @ 2011-03-04 20:53 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Boaz Harrosh, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri, Mar 04, 2011 at 12:51:43PM -0800, Darrick J. Wong wrote: > Hmm. Any interest in pushing the page copy patch as an interim solution while > I work on getting the wait-on-writeback strategy to function? I agree it's not > the fastest solution, but at least it won't be running broken while I find the > faster solution(s). It's not only slow but also butt ugly. NAK from me - we need to get this completely right, not hack around it. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 2:00 [RFC] block integrity: Fix write after checksum calculation problem Darrick J. Wong 2011-02-22 5:45 ` Boaz Harrosh @ 2011-02-22 16:13 ` Andreas Dilger 2011-02-22 16:40 ` Martin K. Petersen ` (2 more replies) 2011-02-22 16:45 ` Martin K. Petersen 2 siblings, 3 replies; 67+ messages in thread From: Andreas Dilger @ 2011-02-22 16:13 UTC (permalink / raw) To: djwong; +Cc: Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On 2011-02-21, at 19:00, "Darrick J. Wong" <djwong@us.ibm.com> wrote: > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > discussion about how to deal with the situation where one program tells the > kernel to write a block to disk, the kernel computes the checksum of that data, > and then a second program begins writing to that same block before the disk HBA > can DMA the memory block, thereby causing the disk to complain about being sent > invalid checksums. > > I was able to write a > trivial program to trigger the write problem, I'm pretty sure that this has not > been fixed upstream. (FYI, using O_DIRECT still seems fine.) Can you please attach your reproducer? IIRC it needed mmap() to hit this problem? Did you measure CPU usage during your testing? > Below is a simple if naive solution: (ab)use the bounce buffering code to copy > the memory page just prior to calculating the checksum, and send the copy and > the checksum to the disk controller. With this patch applied, the invalid > guard tag messages go away. An optimization would be to perform the copy only > when memory contents change, but I wanted to ask peoples' opinions before > continuing. I don't imagine bounce buffering is particularly speedy, though I > haven't noticed any corruption errors or weirdness yet. I don't like adding a data copy in the IO path at all. We are just looking to enable T10 DIF for Lustre and this would definitely hurt performance significantly, even though it isn't needed there at all (since the server side has proper locking of the pages to prevent multiple writers to the same page). > Anyway, I'm mostly wondering: what do people think of this as a starting point > to fixing the DIF checksum problem? I'd definitely prefer that the filesystem be in charge of deciding whether this is needed or not. If the use of the data copy can be constrained to only the minimum required cases (e.g. if fs checks for rewrite on page that is marked as Writeback and either copies or blocks until writeback is complete, as a mount option) that would be better. At that point we can compare on different hardware whether copying or blocking should be the default. Cheers, Andreas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 16:13 ` Andreas Dilger @ 2011-02-22 16:40 ` Martin K. Petersen 2011-02-22 19:45 ` Darrick J. Wong 2011-02-28 8:49 ` Christoph Hellwig 2 siblings, 0 replies; 67+ messages in thread From: Martin K. Petersen @ 2011-02-22 16:40 UTC (permalink / raw) To: Andreas Dilger Cc: djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi >>>>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes: Andreas> I don't like adding a data copy in the IO path at all. No thanks! Andreas> I'd definitely prefer that the filesystem be in charge of Andreas> deciding whether this is needed or not. Absolutely. At the block layer we obviously have no idea whether the filesystem is safe or not. So in my current tree the protection is only generated if the relevant bio flag is set (unless the application already added it, obviously). Andreas> If the use of the data copy can be constrained to only the Andreas> minimum required cases (e.g. if fs checks for rewrite on page Andreas> that is marked as Writeback and either copies or blocks until Andreas> writeback is complete, as a mount option) that would be Andreas> better. At that point we can compare on different hardware Andreas> whether copying or blocking should be the default. Agreed. As Chris mentioned we've got somebody on our team working through the filesystem issues now. I'm hoping we can provide a status update at LSF2011. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 16:13 ` Andreas Dilger 2011-02-22 16:40 ` Martin K. Petersen @ 2011-02-22 19:45 ` Darrick J. Wong 2011-02-22 22:53 ` Dave Chinner 2011-02-28 8:49 ` Christoph Hellwig 2 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-02-22 19:45 UTC (permalink / raw) To: Andreas Dilger Cc: Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Feb 22, 2011 at 09:13:49AM -0700, Andreas Dilger wrote: > On 2011-02-21, at 19:00, "Darrick J. Wong" <djwong@us.ibm.com> wrote: > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > > discussion about how to deal with the situation where one program tells the > > kernel to write a block to disk, the kernel computes the checksum of that data, > > and then a second program begins writing to that same block before the disk HBA > > can DMA the memory block, thereby causing the disk to complain about being sent > > invalid checksums. > > > > I was able to write a > > trivial program to trigger the write problem, I'm pretty sure that this has not > > been fixed upstream. (FYI, using O_DIRECT still seems fine.) > > Can you please attach your reproducer? IIRC it needed mmap() to hit this > problem? Did you measure CPU usage during your testing? I didn't need mmap; a lot of threads using write() was enough. (The reproducer program does have a mmap mode though). Basically it creates a lot of threads to write small blobs to random offsets in a file, with optional mmap, dio, and sync options. CPU usage was 100% the entire time since there were 64 threads running on 4 CPUs. With just write() mode about half that was userspace and the other half was kernel. With write and mmap the balance became closer to 80/20. The program is attached below. It can be built with a simple "cc -o wac wac.c". The invocation that I was using to produce the errors was: ./wac -l65536 -n64 -r /mnt/junk This creates a file that is 64K (16 pages) long and starts 64 threads that will write() a small buffer and then call sync_file_range to force IO to happen. With that I get a steady stream of DIF checksum errors on the console. If -r is omitted they happen about once every commit= seconds. > > Below is a simple if naive solution: (ab)use the bounce buffering code to copy > > the memory page just prior to calculating the checksum, and send the copy and > > the checksum to the disk controller. With this patch applied, the invalid > > guard tag messages go away. An optimization would be to perform the copy only > > when memory contents change, but I wanted to ask peoples' opinions before > > continuing. I don't imagine bounce buffering is particularly speedy, though I > > haven't noticed any corruption errors or weirdness yet. > > I don't like adding a data copy in the IO path at all. We are just looking to > enable T10 DIF for Lustre and this would definitely hurt performance > significantly, even though it isn't needed there at all (since the server > side has proper locking of the pages to prevent multiple writers to the same > page). > > > Anyway, I'm mostly wondering: what do people think of this as a starting point > > to fixing the DIF checksum problem? > > I'd definitely prefer that the filesystem be in charge of deciding whether > this is needed or not. If the use of the data copy can be constrained to only > the minimum required cases (e.g. if fs checks for rewrite on page that is > marked as Writeback and either copies or blocks until writeback is complete, > as a mount option) that would be better. At that point we can compare on > different hardware whether copying or blocking should be the default. Agreed. I too am curious to study which circumstances favor copying vs blocking. > Cheers, Andreas > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html /* * Write-After-Checksum reproducer(?) program * Copyright (C) 2011 IBM. All rights reserved. * This program is licensed under the GPLv2. */ #define _XOPEN_SOURCE 600 #define _FILE_OFFSET_BITS 64 #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/wait.h> #include <sys/mman.h> #include <string.h> #define __USE_GNU #include <fcntl.h> #include <errno.h> #include <inttypes.h> #define SYNC_RANGE 1 #define SYNC_FILE 2 #define DEFAULT_BUFSIZE 4096 static uint32_t bufsize = DEFAULT_BUFSIZE; void help(const char *pname) { printf("Usage: %s [-n threads] [-m threads] [-d] [-b blocksize] [-r] [-f] -l filesize filename\n", pname); printf("-b Size of a memory page.\n"); printf("-d Use direct I/O.\n"); printf("-l Desired file size.\n"); printf("-n Use this many write() threads.\n"); printf("-m Use this many mmap write threads.\n"); printf("-s Synchronous disk writes.\n"); printf("-r Use sync_file_range after write.\n"); printf("-f fsync after write.\n"); } int seed_random(void) { int fp; long seed; fp = open("/dev/urandom", O_RDONLY); if (fp < 0) { perror("/dev/urandom"); return 0; } if (read(fp, &seed, sizeof(seed)) != sizeof(seed)) { perror("read random seed"); return 0; } close(fp); srand(seed); return 1; } uint64_t get_randnum(uint64_t min, uint64_t max) { return (min + (uint64_t)((double)(max - min) * (rand() / (RAND_MAX + 1.0)))); } static uint64_t get_randnum_align(uint64_t min, uint64_t max, uint64_t align) { return (min + (uint64_t)((double)(max - min) * (rand() / (RAND_MAX + 1.0)))) & ~(align - 1); } int write_junk(const char *fname, int flags, int sync_options, uint64_t file_size) { int fd, len; uint64_t offset, generation = 0; char *buf; len = posix_memalign((void **)&buf, bufsize, bufsize); if (len) { errno = len; perror("alloc"); return 66; } fd = open(fname, flags | O_WRONLY); if (fd < 1) { perror(fname); return 64; } while (1) { len = snprintf(buf, bufsize - 1, "%d - %"PRIu64, getpid(), generation++); if (flags & O_DIRECT) { len = bufsize; offset = get_randnum_align(0, file_size - len, bufsize); } else { offset = get_randnum(0, file_size - len); } if (pwrite(fd, buf, len, offset) < 0) { perror("pwrite"); close(fd); free(buf); return 65; } if ((sync_options & SYNC_RANGE) && sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER) < 0) { perror("sync_file_range"); close(fd); free(buf); return 67; } if ((sync_options & SYNC_FILE) && fsync(fd)) { perror("fsync"); close(fd); free(buf); return 68; } } return 0; } int mmap_junk(const char *fname, int flags, int sync_options, uint64_t file_size) { int fd, len; uint64_t offset, generation = 0; char *buf, *map; long page_size; page_size = sysconf(_SC_PAGESIZE); if (page_size < 0) { perror("_SC_PAGESIZE"); return 101; } fd = open(fname, flags | O_RDWR); if (fd < 1) { perror(fname); return 96; } len = posix_memalign((void **)&buf, bufsize, bufsize); if (len) { errno = len; perror("alloc"); return 102; } map = mmap(NULL, file_size, PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { perror(fname); return 97; } while (1) { len = snprintf(buf, bufsize - 1, "%d - %"PRIu64, getpid(), generation++); if (flags & O_DIRECT) { len = bufsize; offset = get_randnum_align(0, file_size - len, bufsize); } else { offset = get_randnum(0, file_size - len); } memcpy(map + offset, buf, len); len += offset & (page_size - 1); offset &= ~(page_size - 1); if ((sync_options & SYNC_RANGE) && msync(map + offset, len, MS_SYNC | MS_INVALIDATE)) { perror("msync"); munmap(map, file_size); close(fd); free(buf); return 99; } if ((sync_options & SYNC_FILE) && fsync(fd)) { perror("fsync"); munmap(map, file_size); close(fd); free(buf); return 100; } } return 0; } int main(int argc, char *argv[]) { int opt, fd; unsigned long i, mthreads = 0, nthreads = 1; char *fname = NULL; int flags = 0, sync_options = 0; uint64_t file_size = 0; pid_t pid; int status; while ((opt = getopt(argc, argv, "b:dn:l:srfm:")) != -1) { switch (opt) { case 'd': flags |= O_DIRECT; break; case 'n': nthreads = strtoul(optarg, NULL, 0); break; case 'm': mthreads = strtoul(optarg, NULL, 0); break; case 'l': file_size = strtoull(optarg, NULL, 0); break; case 's': flags |= O_SYNC; break; case 'b': bufsize = strtoul(optarg, NULL, 0); break; case 'r': sync_options |= SYNC_RANGE; break; case 'f': sync_options |= SYNC_FILE; break; default: help(argv[0]); return 4; } } if (optind != argc - 1 || file_size < 1) { help(argv[0]); return 1; } fname = argv[optind]; if (!seed_random()) return 2; // truncate first fd = open(fname, flags | O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR); if (fd < 0) { perror(fname); return 3; } close(fd); // spawn threads and go to town if (nthreads == 1) return write_junk(fname, flags, sync_options, file_size); for (i = 0; i < nthreads; i++) { pid = fork(); if (!pid) return write_junk(fname, flags, sync_options, file_size); } for (i = 0; i < mthreads; i++) { pid = fork(); if (!pid) return mmap_junk(fname, flags, sync_options, file_size); } for (i = 0; i < mthreads + nthreads; i++) { pid = wait(&status); if (WIFEXITED(status)) printf("Child %d exited with status %d\n", pid, WEXITSTATUS(status)); } return 0; } ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 19:45 ` Darrick J. Wong @ 2011-02-22 22:53 ` Dave Chinner 2011-02-23 16:24 ` Martin K. Petersen 0 siblings, 1 reply; 67+ messages in thread From: Dave Chinner @ 2011-02-22 22:53 UTC (permalink / raw) To: Darrick J. Wong Cc: Andreas Dilger, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Feb 22, 2011 at 11:45:38AM -0800, Darrick J. Wong wrote: > On Tue, Feb 22, 2011 at 09:13:49AM -0700, Andreas Dilger wrote: > > On 2011-02-21, at 19:00, "Darrick J. Wong" <djwong@us.ibm.com> wrote: > > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2 > > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a > > > discussion about how to deal with the situation where one program tells the > > > kernel to write a block to disk, the kernel computes the checksum of that data, > > > and then a second program begins writing to that same block before the disk HBA > > > can DMA the memory block, thereby causing the disk to complain about being sent > > > invalid checksums. > > > > > > I was able to write a > > > trivial program to trigger the write problem, I'm pretty sure that this has not > > > been fixed upstream. (FYI, using O_DIRECT still seems fine.) > > > > Can you please attach your reproducer? IIRC it needed mmap() to hit this > > problem? Did you measure CPU usage during your testing? > > I didn't need mmap; a lot of threads using write() was enough. (The reproducer > program does have a mmap mode though). Basically it creates a lot of threads > to write small blobs to random offsets in a file, with optional mmap, dio, and > sync options. *nod* Both mmap and write paths need to be block on wait_for_page_writeback(page) once they have a locked page ready for modification. btrfs does this in btrfs_page_mkwrite() and prepare_pages(), so adding similar calls into block_page_mkwrite() and grab_cache_page_write_begin() would probably fix the problem for the other major filesystems.... > Agreed. I too am curious to study which circumstances favor copying vs > blocking. IMO blocking is generally preferable in high throughput threaded workloads as there is always another thread that can do useful work while we wait for IO to complete. Most use cases for DIF center around high throughput environments.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 22:53 ` Dave Chinner @ 2011-02-23 16:24 ` Martin K. Petersen 2011-02-23 23:47 ` Dave Chinner 2011-02-24 16:43 ` Jan Kara 0 siblings, 2 replies; 67+ messages in thread From: Martin K. Petersen @ 2011-02-23 16:24 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J. Wong, Andreas Dilger, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes: >> Agreed. I too am curious to study which circumstances favor copying >> vs blocking. Dave> IMO blocking is generally preferable in high throughput threaded Dave> workloads as there is always another thread that can do useful Dave> work while we wait for IO to complete. Most use cases for DIF Dave> center around high throughput environments.... Yeah. A while back I did a bunch of tests with a liberal amount of wait_on_page_writeback() calls added to (I think) ext2, ext3, and XFS. For my regular workloads there was no measurable change (kernel builds, random database and I/O tests). I'm sure we'll unearth some apps that will suffer when DI is on but so far I'm not too worried about blocking in the data path. My main concern is wrt. metadata because that's where extN really hurts. Simple test: Unpack a kernel tarball and watch the directory block fireworks. Given how frequently those buffers get hit I'm sure blocking would cause performance to tank completely. I looked into fixing this in ext2 but I had to stop because my eyes were bleeding. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-23 16:24 ` Martin K. Petersen @ 2011-02-23 23:47 ` Dave Chinner 2011-02-24 16:43 ` Jan Kara 1 sibling, 0 replies; 67+ messages in thread From: Dave Chinner @ 2011-02-23 23:47 UTC (permalink / raw) To: Martin K. Petersen Cc: Darrick J. Wong, Andreas Dilger, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Wed, Feb 23, 2011 at 11:24:50AM -0500, Martin K. Petersen wrote: > >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes: > > >> Agreed. I too am curious to study which circumstances favor copying > >> vs blocking. > > Dave> IMO blocking is generally preferable in high throughput threaded > Dave> workloads as there is always another thread that can do useful > Dave> work while we wait for IO to complete. Most use cases for DIF > Dave> center around high throughput environments.... > > Yeah. > > A while back I did a bunch of tests with a liberal amount of > wait_on_page_writeback() calls added to (I think) ext2, ext3, and > XFS. For my regular workloads there was no measurable change (kernel > builds, random database and I/O tests). I'm sure we'll unearth some apps > that will suffer when DI is on but so far I'm not too worried about > blocking in the data path. > > My main concern is wrt. metadata because that's where extN really > hurts. Simple test: Unpack a kernel tarball and watch the directory > block fireworks. Given how frequently those buffers get hit I'm sure > blocking would cause performance to tank completely. I looked into > fixing this in ext2 but I had to stop because my eyes were bleeding. Not problems with metadata modifications for XFS - all metadata IO is done with a lock held preventing modifications while IO is in progress. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-23 16:24 ` Martin K. Petersen 2011-02-23 23:47 ` Dave Chinner @ 2011-02-24 16:43 ` Jan Kara 1 sibling, 0 replies; 67+ messages in thread From: Jan Kara @ 2011-02-24 16:43 UTC (permalink / raw) To: Martin K. Petersen Cc: Dave Chinner, Darrick J. Wong, Andreas Dilger, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Wed 23-02-11 11:24:50, Martin K. Petersen wrote: > >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes: > > >> Agreed. I too am curious to study which circumstances favor copying > >> vs blocking. > > Dave> IMO blocking is generally preferable in high throughput threaded > Dave> workloads as there is always another thread that can do useful > Dave> work while we wait for IO to complete. Most use cases for DIF > Dave> center around high throughput environments.... > > Yeah. > > A while back I did a bunch of tests with a liberal amount of > wait_on_page_writeback() calls added to (I think) ext2, ext3, and > XFS. For my regular workloads there was no measurable change (kernel > builds, random database and I/O tests). I'm sure we'll unearth some apps > that will suffer when DI is on but so far I'm not too worried about > blocking in the data path. > > My main concern is wrt. metadata because that's where extN really > hurts. Simple test: Unpack a kernel tarball and watch the directory > block fireworks. Given how frequently those buffers get hit I'm sure > blocking would cause performance to tank completely. I looked into > fixing this in ext2 but I had to stop because my eyes were bleeding. Ext2 is problematic yes, but ext[34] should be OK because we do metadata copy anyway because of journalling. So for ext[34] you shouldn't need any additional metadata protection since JBD does it for you (apart from nojournal mode of ext4 of course). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 16:13 ` Andreas Dilger 2011-02-22 16:40 ` Martin K. Petersen 2011-02-22 19:45 ` Darrick J. Wong @ 2011-02-28 8:49 ` Christoph Hellwig 2 siblings, 0 replies; 67+ messages in thread From: Christoph Hellwig @ 2011-02-28 8:49 UTC (permalink / raw) To: Andreas Dilger Cc: djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Feb 22, 2011 at 09:13:49AM -0700, Andreas Dilger wrote: > Can you please attach your reproducer? IIRC it needed mmap() to hit this problem? Did you measure CPU usage during your testing? Running XFSQA on a DIF target reproduces it. Even scsi_debug in the right mode is enough. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 2:00 [RFC] block integrity: Fix write after checksum calculation problem Darrick J. Wong 2011-02-22 5:45 ` Boaz Harrosh 2011-02-22 16:13 ` Andreas Dilger @ 2011-02-22 16:45 ` Martin K. Petersen 2011-02-23 20:24 ` Joel Becker 2 siblings, 1 reply; 67+ messages in thread From: Martin K. Petersen @ 2011-02-22 16:45 UTC (permalink / raw) To: djwong; +Cc: Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi >>>>> "Darrick" == Darrick J Wong <djwong@us.ibm.com> writes: Darrick> Unfortunately, I didn't get a sense that any sort of consensus Darrick> had been reached We had a session on this topic at the storage workshop in Auguest. The consensus was that filesystems really need to handle this. Also, DIX is only the tip of the iceberg. Many other impending technologies feature checksums and require pages to be stable during I/O due to checksumming, encryption and so on. The VM is already trying to do the right thing. We just need the relevant filesystems to catch up. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-22 16:45 ` Martin K. Petersen @ 2011-02-23 20:24 ` Joel Becker 2011-02-23 20:35 ` Chris Mason 0 siblings, 1 reply; 67+ messages in thread From: Joel Becker @ 2011-02-23 20:24 UTC (permalink / raw) To: Martin K. Petersen Cc: djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > Also, DIX is only the tip of the iceberg. Many other impending > technologies feature checksums and require pages to be stable during I/O > due to checksumming, encryption and so on. > > The VM is already trying to do the right thing. We just need the > relevant filesystems to catch up. ocfs2 handles stable metadata for its checksums when feeding things to the journal. If we're doing pagecache-based I/O, is the pagecache going to help here for data? Joel -- "Gone to plant a weeping willow On the bank's green edge it will roll, roll, roll. Sing a lulaby beside the waters. Lovers come and go, the river roll, roll, rolls." http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-23 20:24 ` Joel Becker @ 2011-02-23 20:35 ` Chris Mason 2011-02-23 21:42 ` Joel Becker 2011-02-24 16:47 ` Jan Kara 0 siblings, 2 replies; 67+ messages in thread From: Chris Mason @ 2011-02-23 20:35 UTC (permalink / raw) To: Joel Becker Cc: Martin K. Petersen, djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > Also, DIX is only the tip of the iceberg. Many other impending > > technologies feature checksums and require pages to be stable during I/O > > due to checksumming, encryption and so on. > > > > The VM is already trying to do the right thing. We just need the > > relevant filesystems to catch up. > > ocfs2 handles stable metadata for its checksums when feeding > things to the journal. If we're doing pagecache-based I/O, is the > pagecache going to help here for data? Data is much easier than metadata. All you really need is to wait on writeback in file_write, wait on writeback in page_mkwrite, and make sure you don't free blocks back to the allocator that are actively under IO. I expect the hard part to be jbd and metadata in ext34. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-23 20:35 ` Chris Mason @ 2011-02-23 21:42 ` Joel Becker 2011-02-24 16:47 ` Jan Kara 1 sibling, 0 replies; 67+ messages in thread From: Joel Becker @ 2011-02-23 21:42 UTC (permalink / raw) To: Chris Mason Cc: Martin K. Petersen, djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Wed, Feb 23, 2011 at 03:35:11PM -0500, Chris Mason wrote: > > ocfs2 handles stable metadata for its checksums when feeding > > things to the journal. If we're doing pagecache-based I/O, is the > > pagecache going to help here for data? > > Data is much easier than metadata. All you really need is to wait on > writeback in file_write, wait on writeback in page_mkwrite, and make > sure you don't free blocks back to the allocator that are actively under > IO. > > I expect the hard part to be jbd and metadata in ext34. Yeah, catching use-without-access is not trivial. I keep thinking we've found them all, and then another bug crops up ;-) At least our checksums catch it. Joel -- "The doctrine of human equality reposes on this: that there is no man really clever who has not found that he is stupid." - Gilbert K. Chesterson http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-23 20:35 ` Chris Mason 2011-02-23 21:42 ` Joel Becker @ 2011-02-24 16:47 ` Jan Kara 2011-02-24 17:37 ` Chris Mason 1 sibling, 1 reply; 67+ messages in thread From: Jan Kara @ 2011-02-24 16:47 UTC (permalink / raw) To: Chris Mason Cc: Joel Becker, Martin K. Petersen, djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Wed 23-02-11 15:35:11, Chris Mason wrote: > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > Also, DIX is only the tip of the iceberg. Many other impending > > > technologies feature checksums and require pages to be stable during I/O > > > due to checksumming, encryption and so on. > > > > > > The VM is already trying to do the right thing. We just need the > > > relevant filesystems to catch up. > > > > ocfs2 handles stable metadata for its checksums when feeding > > things to the journal. If we're doing pagecache-based I/O, is the > > pagecache going to help here for data? > > Data is much easier than metadata. All you really need is to wait on > writeback in file_write, wait on writeback in page_mkwrite, and make > sure you don't free blocks back to the allocator that are actively under > IO. > > I expect the hard part to be jbd and metadata in ext34. But JBD already has to do data copy if a buffer is going to be modified before/while it is written to the journal. So we should alredy do all that is needed for metadata. I don't say there aren't any bugs as they could be triggered only by crashing at the wrong moment and observing fs corruption. But most of the work should be there... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-24 16:47 ` Jan Kara @ 2011-02-24 17:37 ` Chris Mason 2011-02-24 18:27 ` Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-02-24 17:37 UTC (permalink / raw) To: Jan Kara Cc: Joel Becker, Martin K. Petersen, djwong, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > technologies feature checksums and require pages to be stable during I/O > > > > due to checksumming, encryption and so on. > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > relevant filesystems to catch up. > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > things to the journal. If we're doing pagecache-based I/O, is the > > > pagecache going to help here for data? > > > > Data is much easier than metadata. All you really need is to wait on > > writeback in file_write, wait on writeback in page_mkwrite, and make > > sure you don't free blocks back to the allocator that are actively under > > IO. > > > > I expect the hard part to be jbd and metadata in ext34. > But JBD already has to do data copy if a buffer is going to be modified > before/while it is written to the journal. So we should alredy do all that > is needed for metadata. I don't say there aren't any bugs as they could be > triggered only by crashing at the wrong moment and observing fs corruption. > But most of the work should be there... Most of it is there, but there are always little bits and pieces. The ext4 journal csumming code was one semi-recent example where we found metadata changing in flight. A big part of testing this is getting some way to detect the bugs without dif/dix. With btrfs I have patches to do set_memory_ro on pages once I've don the crc, hopefully we can generalize that idea or some up with something smarter. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-24 17:37 ` Chris Mason @ 2011-02-24 18:27 ` Darrick J. Wong 2011-02-28 12:54 ` Chris Mason 0 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-02-24 18:27 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > due to checksumming, encryption and so on. > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > relevant filesystems to catch up. > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > pagecache going to help here for data? > > > > > > Data is much easier than metadata. All you really need is to wait on > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > sure you don't free blocks back to the allocator that are actively under > > > IO. > > > > > > I expect the hard part to be jbd and metadata in ext34. > > But JBD already has to do data copy if a buffer is going to be modified > > before/while it is written to the journal. So we should alredy do all that > > is needed for metadata. I don't say there aren't any bugs as they could be > > triggered only by crashing at the wrong moment and observing fs corruption. > > But most of the work should be there... > > Most of it is there, but there are always little bits and pieces. The > ext4 journal csumming code was one semi-recent example where we found > metadata changing in flight. > > A big part of testing this is getting some way to detect the bugs > without dif/dix. With btrfs I have patches to do set_memory_ro on > pages once I've don the crc, hopefully we can generalize that idea or > some up with something smarter. Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199. Hm, would you mind sharing those patches? I've been working on a second patch to do the wait-on-writeback per everyone's suggestions, but I still see the occasional corruption error as soon as I enable the mmap write case and covet some more debugging tools. It does seem to be working for the pure pwrite() case. :) --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-24 18:27 ` Darrick J. Wong @ 2011-02-28 12:54 ` Chris Mason 2011-03-04 21:07 ` Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-02-28 12:54 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > relevant filesystems to catch up. > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > pagecache going to help here for data? > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > > sure you don't free blocks back to the allocator that are actively under > > > > IO. > > > > > > > > I expect the hard part to be jbd and metadata in ext34. > > > But JBD already has to do data copy if a buffer is going to be modified > > > before/while it is written to the journal. So we should alredy do all that > > > is needed for metadata. I don't say there aren't any bugs as they could be > > > triggered only by crashing at the wrong moment and observing fs corruption. > > > But most of the work should be there... > > > > Most of it is there, but there are always little bits and pieces. The > > ext4 journal csumming code was one semi-recent example where we found > > metadata changing in flight. > > > > A big part of testing this is getting some way to detect the bugs > > without dif/dix. With btrfs I have patches to do set_memory_ro on > > pages once I've don the crc, hopefully we can generalize that idea or > > some up with something smarter. > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199. > > Hm, would you mind sharing those patches? I've been working on a second patch > to do the wait-on-writeback per everyone's suggestions, but I still see the > occasional corruption error as soon as I enable the mmap write case and covet > some more debugging tools. It does seem to be working for the pure pwrite() > case. :) Here's an ext4 version of the debugging patch. It's a few years old but it'll give you the idea. This only covers metadata pages. Looks like I hacked the btrfs version up and didn't keep the original, I'll have to rework it, I was trying to use it for the big corruption I fixed recently and made a bunch of changes. For data if mmap is giving you trouble you need to wait on writeback in page_mkwrite, with the page locked. fs/btrfs/inode.c has our page_mkwrite, which uses wait_on_page_writeback() and also the btrfs ordered write code. But for the other filesystems, waiting on writeback should be enough. -chris diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index d4cfd6d..9f278db 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -28,6 +28,16 @@ #include <linux/blkdev.h> #include <trace/events/jbd2.h> +int set_memory_ro(unsigned long addr, int numpages); + +static int set_page_ro(struct page *page) +{ + unsigned long addr = (unsigned long)page_address(page); + if (PageHighMem(page)) + return 0; + return set_memory_ro(addr, 1); +} + /* * Default IO end handler for temporary BJ_IO buffer_heads. */ @@ -639,6 +654,8 @@ void jbd2_journal_commit_transaction(journal_t *journal) set_bit(BH_JWrite, &jh2bh(new_jh)->b_state); wbuf[bufs++] = jh2bh(new_jh); + set_page_ro(jh2bh(new_jh)->b_page); + /* Record the new block's tag in the current descriptor buffer */ diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index a051270..153888e 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -28,6 +28,15 @@ #include <linux/hrtimer.h> static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh); +int set_memory_rw(unsigned long addr, int numpages); +static int set_page_rw(struct page *page) +{ + unsigned long addr = (unsigned long)page_address(page); + if (PageHighMem(page)) + return 0; + return set_memory_rw(addr, 1); +} + /* * jbd2_get_transaction: obtain a new transaction_t object. @@ -1474,9 +1487,11 @@ void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh) break; case BJ_IO: list = &transaction->t_iobuf_list; + set_page_rw(bh->b_page); break; case BJ_Shadow: list = &transaction->t_shadow_list; + set_page_rw(bh->b_page); break; case BJ_LogCtl: list = &transaction->t_log_list; ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-02-28 12:54 ` Chris Mason @ 2011-03-04 21:07 ` Darrick J. Wong 2011-03-04 22:22 ` Andreas Dilger ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-03-04 21:07 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > pagecache going to help here for data? > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > > > sure you don't free blocks back to the allocator that are actively under > > > > > IO. > > > > > > > > > > I expect the hard part to be jbd and metadata in ext34. > > > > But JBD already has to do data copy if a buffer is going to be modified > > > > before/while it is written to the journal. So we should alredy do all that > > > > is needed for metadata. I don't say there aren't any bugs as they could be > > > > triggered only by crashing at the wrong moment and observing fs corruption. > > > > But most of the work should be there... > > > > > > Most of it is there, but there are always little bits and pieces. The > > > ext4 journal csumming code was one semi-recent example where we found > > > metadata changing in flight. > > > > > > A big part of testing this is getting some way to detect the bugs > > > without dif/dix. With btrfs I have patches to do set_memory_ro on > > > pages once I've don the crc, hopefully we can generalize that idea or > > > some up with something smarter. > > > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199. > > > > Hm, would you mind sharing those patches? I've been working on a second patch > > to do the wait-on-writeback per everyone's suggestions, but I still see the > > occasional corruption error as soon as I enable the mmap write case and covet > > some more debugging tools. It does seem to be working for the pure pwrite() > > case. :) > > Here's an ext4 version of the debugging patch. It's a few years old but > it'll give you the idea. This only covers metadata pages. > > Looks like I hacked the btrfs version up and didn't keep the original, > I'll have to rework it, I was trying to use it for the big corruption I > fixed recently and made a bunch of changes. > > For data if mmap is giving you trouble you need to wait on writeback in > page_mkwrite, with the page locked. fs/btrfs/inode.c has our > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs > ordered write code. But for the other filesystems, waiting on writeback > should be enough. Ok, here's what I have so far. I took everyone's suggestions of where to add calls to wait_on_page_writeback, which seems to handle the multiple-write case adequately. Unfortunately, it is still possible to generate checksum errors by scribbling furiously on a mmap'd region, even after adding the writeback wait in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by removing its wait_for_page_writeback call, so I suspect there's a bit more going on in btrfs than I've been able to figure out. The set_memory_ro debugging trick didn't ferret out any write paths that I didn't catch... though it did have the effect of causing occasional fsync() deadlocks. I suppose I could sprinkle in a few more of those write calls to see what happens. Either way, I'm emailing to ask everyone's advice since I've run out of ideas. Or: Did I miss something? Thanks all for the feedback so far! -- fs: Wait for page writeback when rewrite detected Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- fs/buffer.c | 4 +++- fs/ext4/inode.c | 3 +++ mm/filemap.c | 15 +++++++++++++-- 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 2219a76..39e934c 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, ret = VM_FAULT_OOM; else /* -ENOSPC, -EIO, etc */ ret = VM_FAULT_SIGBUS; - } else + } else { + wait_on_page_writeback(page); ret = VM_FAULT_LOCKED; + } out: return ret; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9f7f9e4..2364704 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page, struct inode *inode = page->mapping->host; trace_ext4_writepage(inode, page); +lock_page(page); size = i_size_read(inode); if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; +wait_on_page_writeback(page); + /* * If the page does not have buffers (for whatever reason), * try to create them using __block_write_begin. If this diff --git a/mm/filemap.c b/mm/filemap.c index 83a45d3..f201d80 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write); * Find or create a page at the given pagecache position. Return the locked * page. This function is specifically for buffered writes. */ -struct page *grab_cache_page_write_begin(struct address_space *mapping, - pgoff_t index, unsigned flags) +struct page *__grab_cache_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) { int status; struct page *page; @@ -2243,6 +2243,17 @@ repeat: } return page; } +struct page *grab_cache_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + struct page *p; + + p = __grab_cache_page_write_begin(mapping, index, flags); + if (p) + wait_on_page_writeback(p); + + return p; +} EXPORT_SYMBOL(grab_cache_page_write_begin); static ssize_t generic_perform_write(struct file *file, ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-04 21:07 ` Darrick J. Wong @ 2011-03-04 22:22 ` Andreas Dilger 2011-03-07 19:11 ` Darrick J. Wong 2011-03-07 21:12 ` Chris Mason 2011-03-08 4:56 ` Dave Chinner 2 siblings, 1 reply; 67+ messages in thread From: Andreas Dilger @ 2011-03-04 22:22 UTC (permalink / raw) To: djwong Cc: Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On 2011-03-04, at 2:07 PM, Darrick J. Wong wrote: > Ok, here's what I have so far. I took everyone's suggestions of where to add > calls to wait_on_page_writeback, which seems to handle the multiple-write case > adequately. Unfortunately, it is still possible to generate checksum errors by > scribbling furiously on a mmap'd region, even after adding the writeback wait > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > removing its wait_for_page_writeback call, so I suspect there's a bit more > going on in btrfs than I've been able to figure out. Did you add this to ext4_page_mkwrite() or only ext4_writepage()? It wasn't clear from your description above. > The set_memory_ro debugging trick didn't ferret out any write paths that I > didn't catch... though it did have the effect of causing occasional fsync() > deadlocks. I suppose I could sprinkle in a few more of those write calls to > see what happens. > > Either way, I'm emailing to ask everyone's advice since I've run out of ideas. > Or: Did I miss something? > > Thanks all for the feedback so far! > > -- > fs: Wait for page writeback when rewrite detected > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > > fs/buffer.c | 4 +++- > fs/ext4/inode.c | 3 +++ > mm/filemap.c | 15 +++++++++++++-- > 3 files changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index 2219a76..39e934c 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > ret = VM_FAULT_OOM; > else /* -ENOSPC, -EIO, etc */ > ret = VM_FAULT_SIGBUS; > - } else > + } else { > + wait_on_page_writeback(page); > ret = VM_FAULT_LOCKED; > + } > > out: > return ret; > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 9f7f9e4..2364704 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page, > struct inode *inode = page->mapping->host; > > trace_ext4_writepage(inode, page); > +lock_page(page); > size = i_size_read(inode); > if (page->index == size >> PAGE_CACHE_SHIFT) > len = size & ~PAGE_CACHE_MASK; > else > len = PAGE_CACHE_SIZE; > > +wait_on_page_writeback(page); > + > /* > * If the page does not have buffers (for whatever reason), > * try to create them using __block_write_begin. If this > diff --git a/mm/filemap.c b/mm/filemap.c > index 83a45d3..f201d80 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write); > * Find or create a page at the given pagecache position. Return the locked > * page. This function is specifically for buffered writes. > */ > -struct page *grab_cache_page_write_begin(struct address_space *mapping, > - pgoff_t index, unsigned flags) > +struct page *__grab_cache_page_write_begin(struct address_space *mapping, > + pgoff_t index, unsigned flags) > { > int status; > struct page *page; > @@ -2243,6 +2243,17 @@ repeat: > } > return page; > } > +struct page *grab_cache_page_write_begin(struct address_space *mapping, > + pgoff_t index, unsigned flags) > +{ > + struct page *p; > + > + p = __grab_cache_page_write_begin(mapping, index, flags); > + if (p) > + wait_on_page_writeback(p); > + > + return p; > +} > EXPORT_SYMBOL(grab_cache_page_write_begin); > > static ssize_t generic_perform_write(struct file *file, > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-04 22:22 ` Andreas Dilger @ 2011-03-07 19:11 ` Darrick J. Wong 0 siblings, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-03-07 19:11 UTC (permalink / raw) To: Andreas Dilger Cc: Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri, Mar 04, 2011 at 03:22:30PM -0700, Andreas Dilger wrote: > On 2011-03-04, at 2:07 PM, Darrick J. Wong wrote: > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > adequately. Unfortunately, it is still possible to generate checksum errors by > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > going on in btrfs than I've been able to figure out. > > Did you add this to ext4_page_mkwrite() or only ext4_writepage()? It wasn't > clear from your description above. I added a wait_on_page_writeback to ext4_page_mkwrite which seems to have cut down on the error message frequency, but it hasn't gone away entirely. --D > > > The set_memory_ro debugging trick didn't ferret out any write paths that I > > didn't catch... though it did have the effect of causing occasional fsync() > > deadlocks. I suppose I could sprinkle in a few more of those write calls to > > see what happens. > > > > Either way, I'm emailing to ask everyone's advice since I've run out of ideas. > > Or: Did I miss something? > > > > Thanks all for the feedback so far! > > > > -- > > fs: Wait for page writeback when rewrite detected > > > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > > --- > > > > fs/buffer.c | 4 +++- > > fs/ext4/inode.c | 3 +++ > > mm/filemap.c | 15 +++++++++++++-- > > 3 files changed, 19 insertions(+), 3 deletions(-) > > > > diff --git a/fs/buffer.c b/fs/buffer.c > > index 2219a76..39e934c 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > > ret = VM_FAULT_OOM; > > else /* -ENOSPC, -EIO, etc */ > > ret = VM_FAULT_SIGBUS; > > - } else > > + } else { > > + wait_on_page_writeback(page); > > ret = VM_FAULT_LOCKED; > > + } > > > > out: > > return ret; > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 9f7f9e4..2364704 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page, > > struct inode *inode = page->mapping->host; > > > > trace_ext4_writepage(inode, page); > > +lock_page(page); > > size = i_size_read(inode); > > if (page->index == size >> PAGE_CACHE_SHIFT) > > len = size & ~PAGE_CACHE_MASK; > > else > > len = PAGE_CACHE_SIZE; > > > > +wait_on_page_writeback(page); > > + > > /* > > * If the page does not have buffers (for whatever reason), > > * try to create them using __block_write_begin. If this > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 83a45d3..f201d80 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write); > > * Find or create a page at the given pagecache position. Return the locked > > * page. This function is specifically for buffered writes. > > */ > > -struct page *grab_cache_page_write_begin(struct address_space *mapping, > > - pgoff_t index, unsigned flags) > > +struct page *__grab_cache_page_write_begin(struct address_space *mapping, > > + pgoff_t index, unsigned flags) > > { > > int status; > > struct page *page; > > @@ -2243,6 +2243,17 @@ repeat: > > } > > return page; > > } > > +struct page *grab_cache_page_write_begin(struct address_space *mapping, > > + pgoff_t index, unsigned flags) > > +{ > > + struct page *p; > > + > > + p = __grab_cache_page_write_begin(mapping, index, flags); > > + if (p) > > + wait_on_page_writeback(p); > > + > > + return p; > > +} > > EXPORT_SYMBOL(grab_cache_page_write_begin); > > > > static ssize_t generic_perform_write(struct file *file, > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Cheers, Andreas > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-04 21:07 ` Darrick J. Wong 2011-03-04 22:22 ` Andreas Dilger @ 2011-03-07 21:12 ` Chris Mason 2011-03-08 4:56 ` Dave Chinner 2 siblings, 0 replies; 67+ messages in thread From: Chris Mason @ 2011-03-07 21:12 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Darrick J. Wong's message of 2011-03-04 16:07:24 -0500: > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > > pagecache going to help here for data? > > > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > > > > sure you don't free blocks back to the allocator that are actively under > > > > > > IO. > > > > > > > > > > > > I expect the hard part to be jbd and metadata in ext34. > > > > > But JBD already has to do data copy if a buffer is going to be modified > > > > > before/while it is written to the journal. So we should alredy do all that > > > > > is needed for metadata. I don't say there aren't any bugs as they could be > > > > > triggered only by crashing at the wrong moment and observing fs corruption. > > > > > But most of the work should be there... > > > > > > > > Most of it is there, but there are always little bits and pieces. The > > > > ext4 journal csumming code was one semi-recent example where we found > > > > metadata changing in flight. > > > > > > > > A big part of testing this is getting some way to detect the bugs > > > > without dif/dix. With btrfs I have patches to do set_memory_ro on > > > > pages once I've don the crc, hopefully we can generalize that idea or > > > > some up with something smarter. > > > > > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199. > > > > > > Hm, would you mind sharing those patches? I've been working on a second patch > > > to do the wait-on-writeback per everyone's suggestions, but I still see the > > > occasional corruption error as soon as I enable the mmap write case and covet > > > some more debugging tools. It does seem to be working for the pure pwrite() > > > case. :) > > > > Here's an ext4 version of the debugging patch. It's a few years old but > > it'll give you the idea. This only covers metadata pages. > > > > Looks like I hacked the btrfs version up and didn't keep the original, > > I'll have to rework it, I was trying to use it for the big corruption I > > fixed recently and made a bunch of changes. > > > > For data if mmap is giving you trouble you need to wait on writeback in > > page_mkwrite, with the page locked. fs/btrfs/inode.c has our > > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs > > ordered write code. But for the other filesystems, waiting on writeback > > should be enough. > > Ok, here's what I have so far. I took everyone's suggestions of where to add > calls to wait_on_page_writeback, which seems to handle the multiple-write case > adequately. Unfortunately, it is still possible to generate checksum errors by > scribbling furiously on a mmap'd region, even after adding the writeback wait > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > removing its wait_for_page_writeback call, so I suspect there's a bit more > going on in btrfs than I've been able to figure out. Hi, thanks for working on this. Btrfs waits for page writeback but then it also waits for its own ordered extents, which is basically tracks writeback and some extra btrfs accounting. If you take out the btrfs_lookup_ordered_extent bits, and the waiting on page writeback, you'll see fireworks. Your patch changes ext4_writepage, but by the time writepage is called we should already have waited for PageWriteback. So that should be BUG_ON(PageWriteback(page)). And ext4_writepage should be called with the page locked...so your extra lock_page call should be deadlocking. Maybe all the writes are going through writepages instead? You really want to be changing ext4_page_mkwrite, that's where all the magic for mmap really happens. The block_page_mkwrite call is a good start, but ext4 doesn't use it. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-04 21:07 ` Darrick J. Wong 2011-03-04 22:22 ` Andreas Dilger 2011-03-07 21:12 ` Chris Mason @ 2011-03-08 4:56 ` Dave Chinner 2011-03-10 23:57 ` Darrick J. Wong 2011-03-19 0:07 ` Darrick J. Wong 2 siblings, 2 replies; 67+ messages in thread From: Dave Chinner @ 2011-03-08 4:56 UTC (permalink / raw) To: Darrick J. Wong Cc: Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote: > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > > pagecache going to help here for data? > > > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > > > > sure you don't free blocks back to the allocator that are actively under > > > > > > IO. > > > > > > > > > > > > I expect the hard part to be jbd and metadata in ext34. > > > > > But JBD already has to do data copy if a buffer is going to be modified > > > > > before/while it is written to the journal. So we should alredy do all that > > > > > is needed for metadata. I don't say there aren't any bugs as they could be > > > > > triggered only by crashing at the wrong moment and observing fs corruption. > > > > > But most of the work should be there... > > > > > > > > Most of it is there, but there are always little bits and pieces. The > > > > ext4 journal csumming code was one semi-recent example where we found > > > > metadata changing in flight. > > > > > > > > A big part of testing this is getting some way to detect the bugs > > > > without dif/dix. With btrfs I have patches to do set_memory_ro on > > > > pages once I've don the crc, hopefully we can generalize that idea or > > > > some up with something smarter. > > > > > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199. > > > > > > Hm, would you mind sharing those patches? I've been working on a second patch > > > to do the wait-on-writeback per everyone's suggestions, but I still see the > > > occasional corruption error as soon as I enable the mmap write case and covet > > > some more debugging tools. It does seem to be working for the pure pwrite() > > > case. :) > > > > Here's an ext4 version of the debugging patch. It's a few years old but > > it'll give you the idea. This only covers metadata pages. > > > > Looks like I hacked the btrfs version up and didn't keep the original, > > I'll have to rework it, I was trying to use it for the big corruption I > > fixed recently and made a bunch of changes. > > > > For data if mmap is giving you trouble you need to wait on writeback in > > page_mkwrite, with the page locked. fs/btrfs/inode.c has our > > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs > > ordered write code. But for the other filesystems, waiting on writeback > > should be enough. > > Ok, here's what I have so far. I took everyone's suggestions of where to add > calls to wait_on_page_writeback, which seems to handle the multiple-write case > adequately. Unfortunately, it is still possible to generate checksum errors by > scribbling furiously on a mmap'd region, even after adding the writeback wait > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > removing its wait_for_page_writeback call, so I suspect there's a bit more > going on in btrfs than I've been able to figure out. > > The set_memory_ro debugging trick didn't ferret out any write paths that I > didn't catch... though it did have the effect of causing occasional fsync() > deadlocks. I suppose I could sprinkle in a few more of those write calls to > see what happens. > > Either way, I'm emailing to ask everyone's advice since I've run out of ideas. > Or: Did I miss something? > > Thanks all for the feedback so far! > > -- > fs: Wait for page writeback when rewrite detected > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > > fs/buffer.c | 4 +++- > fs/ext4/inode.c | 3 +++ > mm/filemap.c | 15 +++++++++++++-- > 3 files changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index 2219a76..39e934c 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > ret = VM_FAULT_OOM; > else /* -ENOSPC, -EIO, etc */ > ret = VM_FAULT_SIGBUS; > - } else > + } else { > + wait_on_page_writeback(page); > ret = VM_FAULT_LOCKED; > + } I think this needs to wait before the __block_write_begin() call, not after it. i.e. wait before the page is mapped, not afterwards. .... > diff --git a/mm/filemap.c b/mm/filemap.c > index 83a45d3..f201d80 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write); > * Find or create a page at the given pagecache position. Return the locked > * page. This function is specifically for buffered writes. > */ > -struct page *grab_cache_page_write_begin(struct address_space *mapping, > - pgoff_t index, unsigned flags) > +struct page *__grab_cache_page_write_begin(struct address_space *mapping, > + pgoff_t index, unsigned flags) > { > int status; > struct page *page; > @@ -2243,6 +2243,17 @@ repeat: > } > return page; > } > +struct page *grab_cache_page_write_begin(struct address_space *mapping, > + pgoff_t index, unsigned flags) > +{ > + struct page *p; > + > + p = __grab_cache_page_write_begin(mapping, index, flags); > + if (p) > + wait_on_page_writeback(p); > + > + return p; > +} > EXPORT_SYMBOL(grab_cache_page_write_begin); Not much point in add in a wrapper when nothing else calls __grab_cache_page_write_begin(), which should also be static.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-08 4:56 ` Dave Chinner @ 2011-03-10 23:57 ` Darrick J. Wong 2011-03-11 16:34 ` Chris Mason 2011-03-19 0:07 ` Darrick J. Wong 1 sibling, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-03-10 23:57 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote: > On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote: > > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > > > pagecache going to help here for data? > > > > > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make Hrm... I've been looking for a file_write in ext4; was the aio_write function pointer what you had in mind here? Adding a wait_on_page_writeback to ext4_page_mkwrite eliminated most of the DIF checksum errors in the mmap case. Unfortunately, I still see them, usually within the first few seconds of kicking off the not-first run. I suspect that may be related to the test program truncating the file after each run causing blocks to come and go, so I'll investigate that part of ext4 next. I also noticed that testing the directio+mmap case exhibits the same symptoms. Anyhow, just attaching the latest hack of mine in case anyone wants to have a look. -- fs: Wait for page writeback when rewrite detected --- fs/buffer.c | 2 ++ fs/ext4/inode.c | 6 +++++- mm/filemap.c | 19 +++++++++++++++++-- mm/memory.c | 3 +++ 4 files changed, 27 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 2219a76..f3639f2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2369,6 +2369,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, else end = PAGE_CACHE_SIZE; + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); ret = __block_write_begin(page, 0, end, get_block); if (!ret) ret = block_commit_write(page, 0, end); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9f7f9e4..ba45b31 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2762,6 +2762,8 @@ static int ext4_writepage(struct page *page, */ goto redirty_page; } + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); if (commit_write) /* now mark the buffer_heads as dirty and uptodate */ block_commit_write(page, 0, len); @@ -5865,12 +5867,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (PageMappedToDisk(page)) goto out_unlock; + lock_page(page); + /* this one seems to handle mmap */ + wait_on_page_writeback(page); if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; - lock_page(page); /* * return if we have all the buffers mapped. This avoid * the need to call write_begin/write_end which does a diff --git a/mm/filemap.c b/mm/filemap.c index 83a45d3..61af700 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2217,8 +2217,9 @@ EXPORT_SYMBOL(generic_file_direct_write); * Find or create a page at the given pagecache position. Return the locked * page. This function is specifically for buffered writes. */ -struct page *grab_cache_page_write_begin(struct address_space *mapping, - pgoff_t index, unsigned flags) +static struct page * +__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, + unsigned flags) { int status; struct page *page; @@ -2243,6 +2244,20 @@ repeat: } return page; } + +struct page *grab_cache_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + struct page *p; + + p = __grab_cache_page_write_begin(mapping, index, flags); + if (p) { + WARN_ON(!PageLocked(p)); + wait_on_page_writeback(p); + } + + return p; +} EXPORT_SYMBOL(grab_cache_page_write_begin); static ssize_t generic_perform_write(struct file *file, ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-10 23:57 ` Darrick J. Wong @ 2011-03-11 16:34 ` Chris Mason 2011-03-11 18:51 ` Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-03-11 16:34 UTC (permalink / raw) To: Darrick J. Wong Cc: Dave Chinner, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Darrick J. Wong's message of 2011-03-10 18:57:22 -0500: > On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote: > > > On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote: > > > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > > > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > > > > pagecache going to help here for data? > > > > > > > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > Hrm... I've been looking for a file_write in ext4; was the aio_write function > pointer what you had in mind here? Your change to grab_cache_page_write_begin looks good to me, at least for ext4. For ext3 you have to actually go in and wait for each of the buffer heads in the page, since ext3 (and reiserfs) will write the buffer heads directly without using writepage. Have you confirmed by looking at the block mapping that your crc errors are from data blocks? -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-11 16:34 ` Chris Mason @ 2011-03-11 18:51 ` Darrick J. Wong 0 siblings, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-03-11 18:51 UTC (permalink / raw) To: Chris Mason Cc: Dave Chinner, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri, Mar 11, 2011 at 11:34:05AM -0500, Chris Mason wrote: > Excerpts from Darrick J. Wong's message of 2011-03-10 18:57:22 -0500: > > On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote: > > > > > On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote: > > > > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > > > > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > > > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > > > > > pagecache going to help here for data? > > > > > > > > > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > > Hrm... I've been looking for a file_write in ext4; was the aio_write function > > pointer what you had in mind here? > > Your change to grab_cache_page_write_begin looks good to me, at least > for ext4. For ext3 you have to actually go in and wait for each of the > buffer heads in the page, since ext3 (and reiserfs) will write the buffer heads > directly without using writepage. Heh, I hadn't really given much thought to ext2/3 yet. I'm still trying to figure out what causes the occasional ext4 failures. :) > Have you confirmed by looking at the block mapping that your crc errors > are from data blocks? Yes, debugfs confirms that they are data blocks. --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-08 4:56 ` Dave Chinner 2011-03-10 23:57 ` Darrick J. Wong @ 2011-03-19 0:07 ` Darrick J. Wong 2011-03-19 2:28 ` Andreas Dilger 2011-03-21 14:04 ` Jan Kara 1 sibling, 2 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-03-19 0:07 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote: > On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote: > > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote: > > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500: > > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote: > > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500: > > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote: > > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500: > > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote: > > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending > > > > > > > > > technologies feature checksums and require pages to be stable during I/O > > > > > > > > > due to checksumming, encryption and so on. > > > > > > > > > > > > > > > > > > The VM is already trying to do the right thing. We just need the > > > > > > > > > relevant filesystems to catch up. > > > > > > > > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding > > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the > > > > > > > > pagecache going to help here for data? > > > > > > > > > > > > > > Data is much easier than metadata. All you really need is to wait on > > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make > > > > > > > sure you don't free blocks back to the allocator that are actively under > > > > > > > IO. > > > > > > > > > > > > > > I expect the hard part to be jbd and metadata in ext34. > > > > > > But JBD already has to do data copy if a buffer is going to be modified > > > > > > before/while it is written to the journal. So we should alredy do all that > > > > > > is needed for metadata. I don't say there aren't any bugs as they could be > > > > > > triggered only by crashing at the wrong moment and observing fs corruption. > > > > > > But most of the work should be there... > > > > > > > > > > Most of it is there, but there are always little bits and pieces. The > > > > > ext4 journal csumming code was one semi-recent example where we found > > > > > metadata changing in flight. > > > > > > > > > > A big part of testing this is getting some way to detect the bugs > > > > > without dif/dix. With btrfs I have patches to do set_memory_ro on > > > > > pages once I've don the crc, hopefully we can generalize that idea or > > > > > some up with something smarter. > > > > > > > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199. > > > > > > > > Hm, would you mind sharing those patches? I've been working on a second patch > > > > to do the wait-on-writeback per everyone's suggestions, but I still see the > > > > occasional corruption error as soon as I enable the mmap write case and covet > > > > some more debugging tools. It does seem to be working for the pure pwrite() > > > > case. :) > > > > > > Here's an ext4 version of the debugging patch. It's a few years old but > > > it'll give you the idea. This only covers metadata pages. > > > > > > Looks like I hacked the btrfs version up and didn't keep the original, > > > I'll have to rework it, I was trying to use it for the big corruption I > > > fixed recently and made a bunch of changes. > > > > > > For data if mmap is giving you trouble you need to wait on writeback in > > > page_mkwrite, with the page locked. fs/btrfs/inode.c has our > > > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs > > > ordered write code. But for the other filesystems, waiting on writeback > > > should be enough. > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > adequately. Unfortunately, it is still possible to generate checksum errors by > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > going on in btrfs than I've been able to figure out. I wonder, is it possible for this to happen: 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, but there's no ongoing writeback, so it returns without delay. 2. Thread A starts writing furiously to the page. 3. Thread B runs fsync() or something that results in the page being checksummed and scheduled for writeout. 4. Thread A continues to write furiously(!) on that same page before the controller finishes the DMA transfer. 5. Disk gets the page, which now doesn't match its checksum, and *boom* After letting the stress tool run for a few days, I can say fairly confidently that the write() case doesn't seem to fail regardless of the O_DIRECT setting. However, with writes to mmap regions, failures happen about once every 20-40 minutes, even with O_DIRECT set. To me this suggests some sort of race condition that we seem to win except once every 20 minutes. I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps there's something that I could do just prior to calculating the DIF checksum that would cause any subsequent write attempts to be shuffled back into page_mkwrite. I tried the set_memory_ro thing again, though that led to some recursive lock errors and I noticed that those functions only seem to exist in arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling messy, only seemed to corrupt memory. :) Is there a "correct" way to take a writeable page and make it so that any process trying to write to it ends up hitting the page fault handler where we can then wait for writeback? Or perhaps I am simply barking up the wrong tree? (Just FYI I took the old copy-everything-to-bounce-buffers patch that few people liked for a second spin, and the errors did not surface regardless of what combination of write/mmap and directio/bufferedio I told it to use.) --D > > > > The set_memory_ro debugging trick didn't ferret out any write paths that I > > didn't catch... though it did have the effect of causing occasional fsync() > > deadlocks. I suppose I could sprinkle in a few more of those write calls to > > see what happens. > > > > Either way, I'm emailing to ask everyone's advice since I've run out of ideas. > > Or: Did I miss something? > > > > Thanks all for the feedback so far! > > > > -- > > fs: Wait for page writeback when rewrite detected > > > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > > --- > > > > fs/buffer.c | 4 +++- > > fs/ext4/inode.c | 3 +++ > > mm/filemap.c | 15 +++++++++++++-- > > 3 files changed, 19 insertions(+), 3 deletions(-) > > > > diff --git a/fs/buffer.c b/fs/buffer.c > > index 2219a76..39e934c 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > > ret = VM_FAULT_OOM; > > else /* -ENOSPC, -EIO, etc */ > > ret = VM_FAULT_SIGBUS; > > - } else > > + } else { > > + wait_on_page_writeback(page); > > ret = VM_FAULT_LOCKED; > > + } > > I think this needs to wait before the __block_write_begin() call, > not after it. i.e. wait before the page is mapped, not afterwards. > > .... > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 83a45d3..f201d80 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write); > > * Find or create a page at the given pagecache position. Return the locked > > * page. This function is specifically for buffered writes. > > */ > > -struct page *grab_cache_page_write_begin(struct address_space *mapping, > > - pgoff_t index, unsigned flags) > > +struct page *__grab_cache_page_write_begin(struct address_space *mapping, > > + pgoff_t index, unsigned flags) > > { > > int status; > > struct page *page; > > @@ -2243,6 +2243,17 @@ repeat: > > } > > return page; > > } > > +struct page *grab_cache_page_write_begin(struct address_space *mapping, > > + pgoff_t index, unsigned flags) > > +{ > > + struct page *p; > > + > > + p = __grab_cache_page_write_begin(mapping, index, flags); > > + if (p) > > + wait_on_page_writeback(p); > > + > > + return p; > > +} > > EXPORT_SYMBOL(grab_cache_page_write_begin); > > Not much point in add in a wrapper when nothing else calls > __grab_cache_page_write_begin(), which should also be static.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-19 0:07 ` Darrick J. Wong @ 2011-03-19 2:28 ` Andreas Dilger 2011-03-22 19:23 ` Darrick J. Wong 2011-03-21 14:04 ` Jan Kara 1 sibling, 1 reply; 67+ messages in thread From: Andreas Dilger @ 2011-03-19 2:28 UTC (permalink / raw) To: djwong Cc: Dave Chinner, Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On 2011-03-18, at 6:07 PM, Darrick J. Wong wrote: >> Ok, here's what I have so far. I took everyone's suggestions of where to add >> calls to wait_on_page_writeback, which seems to handle the multiple-write case >> adequately. Unfortunately, it is still possible to generate checksum errors by >> scribbling furiously on a mmap'd region, even after adding the writeback wait >> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by >> removing its wait_for_page_writeback call, so I suspect there's a bit more >> going on in btrfs than I've been able to figure out. > I wonder, is it possible for this to happen: > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > but there's no ongoing writeback, so it returns without delay. > 2. Thread A starts writing furiously to the page. > 3. Thread B runs fsync() or something that results in the page being > checksummed and scheduled for writeout. > 4. Thread A continues to write furiously(!) on that same page before the > controller finishes the DMA transfer. Right, page_mkwrite() is only called for the ro->rw transition. > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > After letting the stress tool run for a few days, I can say fairly confidently > that the write() case doesn't seem to fail regardless of the O_DIRECT setting. > However, with writes to mmap regions, failures happen about once every 20-40 > minutes, even with O_DIRECT set. To me this suggests some sort of race > condition that we seem to win except once every 20 minutes. > > I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps > there's something that I could do just prior to calculating the DIF checksum > that would cause any subsequent write attempts to be shuffled back into > page_mkwrite. I tried the set_memory_ro thing again, though that led to some > recursive lock errors and I noticed that those functions only seem to exist in > arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling > messy, only seemed to corrupt memory. :) This seems like the best solution, IMHO, to ensure that mmap is blocked in page_mkwrite() before it has any chance to dirty the page undergoing checksum. The trick is that you need to set_page_writeback() before setting the page read-only, otherwise the race still exists. > Is there a "correct" way to take a writeable page and make it so that any > process trying to write to it ends up hitting the page fault handler where we > can then wait for writeback? Or perhaps I am simply barking up the wrong tree? > > (Just FYI I took the old copy-everything-to-bounce-buffers patch that few > people liked for a second spin, and the errors did not surface regardless of > what combination of write/mmap and directio/bufferedio I told it to use.) I wouldn't be so much against memcpy() for mmap pages, but it does seem kind of gross that mmap is forcing data copies when a major reason to use mmap is to AVOID data copies. >>> The set_memory_ro debugging trick didn't ferret out any write paths that I >>> didn't catch... though it did have the effect of causing occasional fsync() >>> deadlocks. I suppose I could sprinkle in a few more of those write calls to >>> see what happens. >>> >>> Either way, I'm emailing to ask everyone's advice since I've run out of ideas. >>> Or: Did I miss something? >>> >>> Thanks all for the feedback so far! >>> >>> -- >>> fs: Wait for page writeback when rewrite detected >>> >>> Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> >>> --- >>> >>> fs/buffer.c | 4 +++- >>> fs/ext4/inode.c | 3 +++ >>> mm/filemap.c | 15 +++++++++++++-- >>> 3 files changed, 19 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/buffer.c b/fs/buffer.c >>> index 2219a76..39e934c 100644 >>> --- a/fs/buffer.c >>> +++ b/fs/buffer.c >>> @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, >>> ret = VM_FAULT_OOM; >>> else /* -ENOSPC, -EIO, etc */ >>> ret = VM_FAULT_SIGBUS; >>> - } else >>> + } else { >>> + wait_on_page_writeback(page); >>> ret = VM_FAULT_LOCKED; >>> + } >> >> I think this needs to wait before the __block_write_begin() call, >> not after it. i.e. wait before the page is mapped, not afterwards. >> >> .... >>> diff --git a/mm/filemap.c b/mm/filemap.c >>> index 83a45d3..f201d80 100644 >>> --- a/mm/filemap.c >>> +++ b/mm/filemap.c >>> @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write); >>> * Find or create a page at the given pagecache position. Return the locked >>> * page. This function is specifically for buffered writes. >>> */ >>> -struct page *grab_cache_page_write_begin(struct address_space *mapping, >>> - pgoff_t index, unsigned flags) >>> +struct page *__grab_cache_page_write_begin(struct address_space *mapping, >>> + pgoff_t index, unsigned flags) >>> { >>> int status; >>> struct page *page; >>> @@ -2243,6 +2243,17 @@ repeat: >>> } >>> return page; >>> } >>> +struct page *grab_cache_page_write_begin(struct address_space *mapping, >>> + pgoff_t index, unsigned flags) >>> +{ >>> + struct page *p; >>> + >>> + p = __grab_cache_page_write_begin(mapping, index, flags); >>> + if (p) >>> + wait_on_page_writeback(p); >>> + >>> + return p; >>> +} >>> EXPORT_SYMBOL(grab_cache_page_write_begin); >> >> Not much point in add in a wrapper when nothing else calls >> __grab_cache_page_write_begin(), which should also be static.... >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-19 2:28 ` Andreas Dilger @ 2011-03-22 19:23 ` Darrick J. Wong 2011-03-22 21:54 ` Jan Kara 0 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-03-22 19:23 UTC (permalink / raw) To: Andreas Dilger Cc: Dave Chinner, Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri, Mar 18, 2011 at 08:28:26PM -0600, Andreas Dilger wrote: > On 2011-03-18, at 6:07 PM, Darrick J. Wong wrote: > >> Ok, here's what I have so far. I took everyone's suggestions of where to add > >> calls to wait_on_page_writeback, which seems to handle the multiple-write case > >> adequately. Unfortunately, it is still possible to generate checksum errors by > >> scribbling furiously on a mmap'd region, even after adding the writeback wait > >> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > >> removing its wait_for_page_writeback call, so I suspect there's a bit more > >> going on in btrfs than I've been able to figure out. > > > I wonder, is it possible for this to happen: > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > but there's no ongoing writeback, so it returns without delay. > > 2. Thread A starts writing furiously to the page. > > 3. Thread B runs fsync() or something that results in the page being > > checksummed and scheduled for writeout. > > 4. Thread A continues to write furiously(!) on that same page before the > > controller finishes the DMA transfer. > > Right, page_mkwrite() is only called for the ro->rw transition. > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > > > After letting the stress tool run for a few days, I can say fairly confidently > > that the write() case doesn't seem to fail regardless of the O_DIRECT setting. > > However, with writes to mmap regions, failures happen about once every 20-40 > > minutes, even with O_DIRECT set. To me this suggests some sort of race > > condition that we seem to win except once every 20 minutes. > > > > I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps > > there's something that I could do just prior to calculating the DIF checksum > > that would cause any subsequent write attempts to be shuffled back into > > page_mkwrite. I tried the set_memory_ro thing again, though that led to some > > recursive lock errors and I noticed that those functions only seem to exist in > > arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling > > messy, only seemed to corrupt memory. :) > > This seems like the best solution, IMHO, to ensure that mmap is blocked in > page_mkwrite() before it has any chance to dirty the page undergoing > checksum. The trick is that you need to set_page_writeback() before setting > the page read-only, otherwise the race still exists. I figured out that the recursive locking errors only happened in the set_memory_rw half of the ro/rw memory pair, and that I could make them go away (for now) by do set_memory_rw in the kintegrityd workqueue. Then I added a call to set_page_writeback just prior to the set_memory_ro call, though that resulted in a lot of complaints about invalid page states and the like. It would seem that the memory pages that arrive in bio_integrity_prep from jbd2 don't have the writeback flag set, and setting it causes problems for it. The writeback flag is set on all the pages that are associated with a checksum failure, I noticed. As for changing pte's around... does that set_memory_ro change the pte flags for all running processes? I'm not so sure it does for anything other than the current process. I think I saw a flush_tlb call in there... though I don't think it helps me much. If I /don't/ set the flag, the frequency of the errors decreases further to about once an hour, but I still see the occasional error. :/ Currently I'm trying to figure out how one might distinguish dirty pages that shouldn't have writeback set vs. pages that ought to have it but don't. I suppose I could pull out the 're-checksum and resubmit' patch I made a while back, though it seems like a bandaid. > > Is there a "correct" way to take a writeable page and make it so that any > > process trying to write to it ends up hitting the page fault handler where we > > can then wait for writeback? Or perhaps I am simply barking up the wrong tree? > > > > (Just FYI I took the old copy-everything-to-bounce-buffers patch that few > > people liked for a second spin, and the errors did not surface regardless of > > what combination of write/mmap and directio/bufferedio I told it to use.) > > I wouldn't be so much against memcpy() for mmap pages, but it does seem kind > of gross that mmap is forcing data copies when a major reason to use mmap is > to AVOID data copies. Yeah. We probably want to avoid having to find extra pages too. :( --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-22 19:23 ` Darrick J. Wong @ 2011-03-22 21:54 ` Jan Kara 0 siblings, 0 replies; 67+ messages in thread From: Jan Kara @ 2011-03-22 21:54 UTC (permalink / raw) To: Darrick J. Wong Cc: Andreas Dilger, Dave Chinner, Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Tue 22-03-11 12:23:01, Darrick J. Wong wrote: > On Fri, Mar 18, 2011 at 08:28:26PM -0600, Andreas Dilger wrote: > > This seems like the best solution, IMHO, to ensure that mmap is blocked in > > page_mkwrite() before it has any chance to dirty the page undergoing > > checksum. The trick is that you need to set_page_writeback() before setting > > the page read-only, otherwise the race still exists. > > I figured out that the recursive locking errors only happened in the > set_memory_rw half of the ro/rw memory pair, and that I could make them go away > (for now) by do set_memory_rw in the kintegrityd workqueue. Then I added a > call to set_page_writeback just prior to the set_memory_ro call, though that > resulted in a lot of complaints about invalid page states and the like. It > would seem that the memory pages that arrive in bio_integrity_prep from jbd2 > don't have the writeback flag set, and setting it causes problems for it. The > writeback flag is set on all the pages that are associated with a checksum > failure, I noticed. Yeah, pages submitted by jbd2 don't have writeback flag set because they are metadata blocks written directly via buffer heads. But as you noted, these are protected in a different way by the journalling layer so shouldn't need to worry. > As for changing pte's around... does that set_memory_ro change the pte flags > for all running processes? I'm not so sure it does for anything other than the > current process. I think I saw a flush_tlb call in there... though I don't > think it helps me much. > > If I /don't/ set the flag, the frequency of the errors decreases further to > about once an hour, but I still see the occasional error. :/ Currently I'm > trying to figure out how one might distinguish dirty pages that shouldn't have > writeback set vs. pages that ought to have it but don't. It's difficult at the block layer... If page->mapping->host is not a device inode, the page should have PageWriteback set. If it is a device inode, you don't know - JBD2 will submit pages without PageWriteback set, flusher thread will submit pages with PageWriteback set. And both is OK since we use buffer state for synchronization. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-19 0:07 ` Darrick J. Wong 2011-03-19 2:28 ` Andreas Dilger @ 2011-03-21 14:04 ` Jan Kara 2011-03-21 14:24 ` Chris Mason 1 sibling, 1 reply; 67+ messages in thread From: Jan Kara @ 2011-03-21 14:04 UTC (permalink / raw) To: Darrick J. Wong Cc: Dave Chinner, Chris Mason, Jan Kara, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > going on in btrfs than I've been able to figure out. > > I wonder, is it possible for this to happen: > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > but there's no ongoing writeback, so it returns without delay. > 2. Thread A starts writing furiously to the page. > 3. Thread B runs fsync() or something that results in the page being > checksummed and scheduled for writeout. > 4. Thread A continues to write furiously(!) on that same page before the > controller finishes the DMA transfer. > 5. Disk gets the page, which now doesn't match its checksum, and *boom* What happens on writepage (see mm/page-writeback.c:write_cache_pages()) is: lock_page(page) ... clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in PTE ... set_page_writeback() (happens e.g. in __block_write_full_page() called from filesystem's writepage implementation). unlock_page(page) So if you compute the checksum after set_page_writeback() is done in the writepage() implementation (you cannot use __block_write_full_page() in that case) and you call wait_on_page_writeback() in ext4_page_mkwrite() under page lock, you should be safe. If you do all this and still see errors, something is broken I'd say... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-21 14:04 ` Jan Kara @ 2011-03-21 14:24 ` Chris Mason 2011-03-21 16:43 ` Jan Kara 0 siblings, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-03-21 14:24 UTC (permalink / raw) To: Jan Kara Cc: Darrick J. Wong, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > going on in btrfs than I've been able to figure out. > > > > I wonder, is it possible for this to happen: > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > but there's no ongoing writeback, so it returns without delay. > > 2. Thread A starts writing furiously to the page. > > 3. Thread B runs fsync() or something that results in the page being > > checksummed and scheduled for writeout. > > 4. Thread A continues to write furiously(!) on that same page before the > > controller finishes the DMA transfer. > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > is: > lock_page(page) > ... > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > PTE > ... > set_page_writeback() (happens e.g. in __block_write_full_page() called > from filesystem's writepage implementation). > unlock_page(page) > > So if you compute the checksum after set_page_writeback() is done in the > writepage() implementation (you cannot use __block_write_full_page() in > that case) and you call wait_on_page_writeback() in ext4_page_mkwrite() > under page lock, you should be safe. If you do all this and still see > errors, something is broken I'd say... Looking at the ext4_page_mkwrite, it does this: lock the page check for holes unlock the page if (no_holes) return; write_begin/write_end return So, to have page_mkwrite work, you need to wait for writeback with the page locked in both the no holes case and after the write_begin/write_end. write_begin will dirty the page, so someone can wander in and start the IO while we are still in page_mkwrite. This is untested and uncompiled, but it should do the trick. Jan, did you get rid of all the buffer head based writeback for data=ordered in ext4? That's my only other idea, that someone is doing writeback directly without taking the page lock. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9f7f9e4..8a75e12 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { + wait_on_page_writeback(page); unlock_page(page); goto out_unlock; } @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (ret < 0) goto out_unlock; ret = 0; + + /* + * write_begin/end might have created a dirty page and someone + * could wander in and start the IO. Make sure that hasn't + * happened + */ + lock_page(page); + wait_on_page_writeback(page); + unlock_page(page); + out_unlock: if (ret) ret = VM_FAULT_SIGBUS; ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-21 14:24 ` Chris Mason @ 2011-03-21 16:43 ` Jan Kara 2011-04-06 23:29 ` Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Jan Kara @ 2011-03-21 16:43 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, Darrick J. Wong, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon 21-03-11 10:24:41, Chris Mason wrote: > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > > going on in btrfs than I've been able to figure out. > > > > > > I wonder, is it possible for this to happen: > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > > but there's no ongoing writeback, so it returns without delay. > > > 2. Thread A starts writing furiously to the page. > > > 3. Thread B runs fsync() or something that results in the page being > > > checksummed and scheduled for writeout. > > > 4. Thread A continues to write furiously(!) on that same page before the > > > controller finishes the DMA transfer. > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > > is: > > lock_page(page) > > ... > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > > PTE > > ... > > set_page_writeback() (happens e.g. in __block_write_full_page() called > > from filesystem's writepage implementation). > > unlock_page(page) > > > > So if you compute the checksum after set_page_writeback() is done in the > > writepage() implementation (you cannot use __block_write_full_page() in > > that case) I should add that if you are computing the checksum in the block layer once the bio is submitted, you obviously are computing it after the page is marked as writeback. So that should be fine... > > and you call wait_on_page_writeback() in ext4_page_mkwrite() > > under page lock, you should be safe. If you do all this and still see > > errors, something is broken I'd say... > > Looking at the ext4_page_mkwrite, it does this: > > lock the page > check for holes > unlock the page > if (no_holes) > return; > > write_begin/write_end > return > > So, to have page_mkwrite work, you need to wait for writeback with the > page locked in both the no holes case and after the > write_begin/write_end. write_begin will dirty the page, so someone can > wander in and start the IO while we are still in page_mkwrite. Oh right, that's a good point. > This is untested and uncompiled, but it should > do the trick. > > Jan, did you get rid of all the buffer head based writeback for > data=ordered in ext4? That's my only other idea, that someone is doing > writeback directly without taking the page lock. Yes, ext4 shouldn't do any buffer based writeback. > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 9f7f9e4..8a75e12 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > if (page_has_buffers(page)) { > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > ext4_bh_unmapped)) { > + wait_on_page_writeback(page); > unlock_page(page); > goto out_unlock; > } > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > if (ret < 0) > goto out_unlock; > ret = 0; > + > + /* > + * write_begin/end might have created a dirty page and someone > + * could wander in and start the IO. Make sure that hasn't > + * happened > + */ > + lock_page(page); > + wait_on_page_writeback(page); > + unlock_page(page); > + > out_unlock: > if (ret) > ret = VM_FAULT_SIGBUS; > This looks good AFAICT. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-03-21 16:43 ` Jan Kara @ 2011-04-06 23:29 ` Darrick J. Wong 2011-04-07 16:44 ` Darrick J. Wong 2011-04-07 16:57 ` Jan Kara 0 siblings, 2 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-04-06 23:29 UTC (permalink / raw) To: Jan Kara Cc: Chris Mason, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote: > On Mon 21-03-11 10:24:41, Chris Mason wrote: > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > > > going on in btrfs than I've been able to figure out. > > > > > > > > I wonder, is it possible for this to happen: > > > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > > > but there's no ongoing writeback, so it returns without delay. > > > > 2. Thread A starts writing furiously to the page. > > > > 3. Thread B runs fsync() or something that results in the page being > > > > checksummed and scheduled for writeout. > > > > 4. Thread A continues to write furiously(!) on that same page before the > > > > controller finishes the DMA transfer. > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > > > is: > > > lock_page(page) > > > ... > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > > > PTE > > > ... > > > set_page_writeback() (happens e.g. in __block_write_full_page() called > > > from filesystem's writepage implementation). > > > unlock_page(page) > > > > > > So if you compute the checksum after set_page_writeback() is done in the > > > writepage() implementation (you cannot use __block_write_full_page() in > > > that case) > I should add that if you are computing the checksum in the block layer > once the bio is submitted, you obviously are computing it after the page is > marked as writeback. So that should be fine... > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite() > > > under page lock, you should be safe. If you do all this and still see > > > errors, something is broken I'd say... > > > > Looking at the ext4_page_mkwrite, it does this: > > > > lock the page > > check for holes > > unlock the page > > if (no_holes) > > return; > > > > write_begin/write_end > > return > > > > So, to have page_mkwrite work, you need to wait for writeback with the > > page locked in both the no holes case and after the > > write_begin/write_end. write_begin will dirty the page, so someone can > > wander in and start the IO while we are still in page_mkwrite. > Oh right, that's a good point. > > > This is untested and uncompiled, but it should > > do the trick. > > > > Jan, did you get rid of all the buffer head based writeback for > > data=ordered in ext4? That's my only other idea, that someone is doing > > writeback directly without taking the page lock. > Yes, ext4 shouldn't do any buffer based writeback. > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 9f7f9e4..8a75e12 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > if (page_has_buffers(page)) { > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > ext4_bh_unmapped)) { > > + wait_on_page_writeback(page); > > unlock_page(page); > > goto out_unlock; > > } > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > if (ret < 0) > > goto out_unlock; > > ret = 0; > > + > > + /* > > + * write_begin/end might have created a dirty page and someone > > + * could wander in and start the IO. Make sure that hasn't > > + * happened > > + */ > > + lock_page(page); > > + wait_on_page_writeback(page); > > + unlock_page(page); > > + > > out_unlock: > > if (ret) > > ret = VM_FAULT_SIGBUS; > > > This looks good AFAICT. I gave this a spin a couple of weeks ago (and accidentally left the test machines running for a full week!) From what I can tell, with all the various wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors down to about 7-8 per day. Not bad, but not fixed. On the odd chance that jbd2 really can provide stable pages during writeback, I am now rerunning the test with no patches and data=journal, while noting that (a) DIO mode doesn't work with data=journal and (b) the first write failure will probably cause the journal to abort == game over. When that's done I'll give the wait-for-writeback patches a whirl with 2.6.39-rc. Enjoy the warm SF weather! --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-06 23:29 ` Darrick J. Wong @ 2011-04-07 16:44 ` Darrick J. Wong 2011-04-07 16:57 ` Jan Kara 1 sibling, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-04-07 16:44 UTC (permalink / raw) To: Jan Kara Cc: Chris Mason, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Wed, Apr 06, 2011 at 04:29:38PM -0700, Darrick J. Wong wrote: > On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote: > > On Mon 21-03-11 10:24:41, Chris Mason wrote: > > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > > > > going on in btrfs than I've been able to figure out. > > > > > > > > > > I wonder, is it possible for this to happen: > > > > > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > > > > but there's no ongoing writeback, so it returns without delay. > > > > > 2. Thread A starts writing furiously to the page. > > > > > 3. Thread B runs fsync() or something that results in the page being > > > > > checksummed and scheduled for writeout. > > > > > 4. Thread A continues to write furiously(!) on that same page before the > > > > > controller finishes the DMA transfer. > > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > > > > is: > > > > lock_page(page) > > > > ... > > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > > > > PTE > > > > ... > > > > set_page_writeback() (happens e.g. in __block_write_full_page() called > > > > from filesystem's writepage implementation). > > > > unlock_page(page) > > > > > > > > So if you compute the checksum after set_page_writeback() is done in the > > > > writepage() implementation (you cannot use __block_write_full_page() in > > > > that case) > > I should add that if you are computing the checksum in the block layer > > once the bio is submitted, you obviously are computing it after the page is > > marked as writeback. So that should be fine... > > > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite() > > > > under page lock, you should be safe. If you do all this and still see > > > > errors, something is broken I'd say... > > > > > > Looking at the ext4_page_mkwrite, it does this: > > > > > > lock the page > > > check for holes > > > unlock the page > > > if (no_holes) > > > return; > > > > > > write_begin/write_end > > > return > > > > > > So, to have page_mkwrite work, you need to wait for writeback with the > > > page locked in both the no holes case and after the > > > write_begin/write_end. write_begin will dirty the page, so someone can > > > wander in and start the IO while we are still in page_mkwrite. > > Oh right, that's a good point. > > > > > This is untested and uncompiled, but it should > > > do the trick. > > > > > > Jan, did you get rid of all the buffer head based writeback for > > > data=ordered in ext4? That's my only other idea, that someone is doing > > > writeback directly without taking the page lock. > > Yes, ext4 shouldn't do any buffer based writeback. > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > index 9f7f9e4..8a75e12 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > if (page_has_buffers(page)) { > > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > > ext4_bh_unmapped)) { > > > + wait_on_page_writeback(page); > > > unlock_page(page); > > > goto out_unlock; > > > } > > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > if (ret < 0) > > > goto out_unlock; > > > ret = 0; > > > + > > > + /* > > > + * write_begin/end might have created a dirty page and someone > > > + * could wander in and start the IO. Make sure that hasn't > > > + * happened > > > + */ > > > + lock_page(page); > > > + wait_on_page_writeback(page); > > > + unlock_page(page); > > > + > > > out_unlock: > > > if (ret) > > > ret = VM_FAULT_SIGBUS; > > > > > This looks good AFAICT. > > I gave this a spin a couple of weeks ago (and accidentally left the test > machines running for a full week!) From what I can tell, with all the various > wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors > down to about 7-8 per day. Not bad, but not fixed. > > On the odd chance that jbd2 really can provide stable pages during writeback, I > am now rerunning the test with no patches and data=journal, while noting that > (a) DIO mode doesn't work with data=journal and (b) the first write failure > will probably cause the journal to abort == game over. When that's done I'll Heh, nope, even with data=journal I still see checksum errors. So much for that hare-brained theory. :( --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-06 23:29 ` Darrick J. Wong 2011-04-07 16:44 ` Darrick J. Wong @ 2011-04-07 16:57 ` Jan Kara 2011-04-08 20:31 ` Darrick J. Wong 1 sibling, 1 reply; 67+ messages in thread From: Jan Kara @ 2011-04-07 16:57 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Chris Mason, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Wed 06-04-11 16:29:38, Darrick J. Wong wrote: > On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote: > > On Mon 21-03-11 10:24:41, Chris Mason wrote: > > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > > > > going on in btrfs than I've been able to figure out. > > > > > > > > > > I wonder, is it possible for this to happen: > > > > > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > > > > but there's no ongoing writeback, so it returns without delay. > > > > > 2. Thread A starts writing furiously to the page. > > > > > 3. Thread B runs fsync() or something that results in the page being > > > > > checksummed and scheduled for writeout. > > > > > 4. Thread A continues to write furiously(!) on that same page before the > > > > > controller finishes the DMA transfer. > > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > > > > is: > > > > lock_page(page) > > > > ... > > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > > > > PTE > > > > ... > > > > set_page_writeback() (happens e.g. in __block_write_full_page() called > > > > from filesystem's writepage implementation). > > > > unlock_page(page) > > > > > > > > So if you compute the checksum after set_page_writeback() is done in the > > > > writepage() implementation (you cannot use __block_write_full_page() in > > > > that case) > > I should add that if you are computing the checksum in the block layer > > once the bio is submitted, you obviously are computing it after the page is > > marked as writeback. So that should be fine... > > > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite() > > > > under page lock, you should be safe. If you do all this and still see > > > > errors, something is broken I'd say... > > > > > > Looking at the ext4_page_mkwrite, it does this: > > > > > > lock the page > > > check for holes > > > unlock the page > > > if (no_holes) > > > return; > > > > > > write_begin/write_end > > > return > > > > > > So, to have page_mkwrite work, you need to wait for writeback with the > > > page locked in both the no holes case and after the > > > write_begin/write_end. write_begin will dirty the page, so someone can > > > wander in and start the IO while we are still in page_mkwrite. > > Oh right, that's a good point. > > > > > This is untested and uncompiled, but it should > > > do the trick. > > > > > > Jan, did you get rid of all the buffer head based writeback for > > > data=ordered in ext4? That's my only other idea, that someone is doing > > > writeback directly without taking the page lock. > > Yes, ext4 shouldn't do any buffer based writeback. > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > index 9f7f9e4..8a75e12 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > if (page_has_buffers(page)) { > > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > > ext4_bh_unmapped)) { > > > + wait_on_page_writeback(page); > > > unlock_page(page); > > > goto out_unlock; > > > } > > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > if (ret < 0) > > > goto out_unlock; > > > ret = 0; > > > + > > > + /* > > > + * write_begin/end might have created a dirty page and someone > > > + * could wander in and start the IO. Make sure that hasn't > > > + * happened > > > + */ > > > + lock_page(page); > > > + wait_on_page_writeback(page); > > > + unlock_page(page); > > > + > > > out_unlock: > > > if (ret) > > > ret = VM_FAULT_SIGBUS; > > > > > This looks good AFAICT. > > I gave this a spin a couple of weeks ago (and accidentally left the test > machines running for a full week!) From what I can tell, with all the various > wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors > down to about 7-8 per day. Not bad, but not fixed. Ugh, strange. Can you post the full patch you are currently using? I've already lost track of all the proposed changes... Thanks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-07 16:57 ` Jan Kara @ 2011-04-08 20:31 ` Darrick J. Wong 2011-04-11 16:42 ` Jeff Layton 0 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-04-08 20:31 UTC (permalink / raw) To: Jan Kara Cc: Chris Mason, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Thu, Apr 07, 2011 at 06:57:00PM +0200, Jan Kara wrote: > On Wed 06-04-11 16:29:38, Darrick J. Wong wrote: > > On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote: > > > On Mon 21-03-11 10:24:41, Chris Mason wrote: > > > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > > > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > > > > > going on in btrfs than I've been able to figure out. > > > > > > > > > > > > I wonder, is it possible for this to happen: > > > > > > > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > > > > > but there's no ongoing writeback, so it returns without delay. > > > > > > 2. Thread A starts writing furiously to the page. > > > > > > 3. Thread B runs fsync() or something that results in the page being > > > > > > checksummed and scheduled for writeout. > > > > > > 4. Thread A continues to write furiously(!) on that same page before the > > > > > > controller finishes the DMA transfer. > > > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > > > > > is: > > > > > lock_page(page) > > > > > ... > > > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > > > > > PTE > > > > > ... > > > > > set_page_writeback() (happens e.g. in __block_write_full_page() called > > > > > from filesystem's writepage implementation). > > > > > unlock_page(page) > > > > > > > > > > So if you compute the checksum after set_page_writeback() is done in the > > > > > writepage() implementation (you cannot use __block_write_full_page() in > > > > > that case) > > > I should add that if you are computing the checksum in the block layer > > > once the bio is submitted, you obviously are computing it after the page is > > > marked as writeback. So that should be fine... > > > > > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite() > > > > > under page lock, you should be safe. If you do all this and still see > > > > > errors, something is broken I'd say... > > > > > > > > Looking at the ext4_page_mkwrite, it does this: > > > > > > > > lock the page > > > > check for holes > > > > unlock the page > > > > if (no_holes) > > > > return; > > > > > > > > write_begin/write_end > > > > return > > > > > > > > So, to have page_mkwrite work, you need to wait for writeback with the > > > > page locked in both the no holes case and after the > > > > write_begin/write_end. write_begin will dirty the page, so someone can > > > > wander in and start the IO while we are still in page_mkwrite. > > > Oh right, that's a good point. > > > > > > > This is untested and uncompiled, but it should > > > > do the trick. > > > > > > > > Jan, did you get rid of all the buffer head based writeback for > > > > data=ordered in ext4? That's my only other idea, that someone is doing > > > > writeback directly without taking the page lock. > > > Yes, ext4 shouldn't do any buffer based writeback. > > > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > > index 9f7f9e4..8a75e12 100644 > > > > --- a/fs/ext4/inode.c > > > > +++ b/fs/ext4/inode.c > > > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > if (page_has_buffers(page)) { > > > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > > > ext4_bh_unmapped)) { > > > > + wait_on_page_writeback(page); > > > > unlock_page(page); > > > > goto out_unlock; > > > > } > > > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > if (ret < 0) > > > > goto out_unlock; > > > > ret = 0; > > > > + > > > > + /* > > > > + * write_begin/end might have created a dirty page and someone > > > > + * could wander in and start the IO. Make sure that hasn't > > > > + * happened > > > > + */ > > > > + lock_page(page); > > > > + wait_on_page_writeback(page); > > > > + unlock_page(page); > > > > + > > > > out_unlock: > > > > if (ret) > > > > ret = VM_FAULT_SIGBUS; > > > > > > > This looks good AFAICT. > > > > I gave this a spin a couple of weeks ago (and accidentally left the test > > machines running for a full week!) From what I can tell, with all the various > > wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors > > down to about 7-8 per day. Not bad, but not fixed. > Ugh, strange. Can you post the full patch you are currently using? I've > already lost track of all the proposed changes... Thanks. Yes, me too. Attached is the giant patch I've been working on. --D fs: Wait for page writeback when rewrite detected, and mark pages ro during writeback Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- diff --git a/fs/buffer.c b/fs/buffer.c index a08bb8e..dd429fe 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2357,6 +2357,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, else end = PAGE_CACHE_SIZE; + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); ret = __block_write_begin(page, 0, end, get_block); if (!ret) ret = block_commit_write(page, 0, end); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1a86282..57cd028 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2666,6 +2666,8 @@ static int ext4_writepage(struct page *page, */ goto redirty_page; } + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); if (commit_write) /* now mark the buffer_heads as dirty and uptodate */ block_commit_write(page, 0, len); @@ -2778,7 +2780,8 @@ static int write_cache_pages_da(struct address_space *mapping, } lock_page(page); - + if (PageWriteback(page)) + wait_on_page_writeback(page); /* * If the page is no longer dirty, or its * mapping no longer corresponds to inode we @@ -5803,12 +5806,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (PageMappedToDisk(page)) goto out_unlock; + lock_page(page); + /* this one seems to handle mmap */ + wait_on_page_writeback(page); if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; - lock_page(page); /* * return if we have all the buffers mapped. This avoid * the need to call write_begin/write_end which does a @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (ret < 0) goto out_unlock; ret = 0; + + /* + * write_begin/end might have created a dirty page and someone + * could wander in and start the IO. Make sure that hasn't + * happened. + */ + lock_page(page); + wait_on_page_writeback(page); + unlock_page(page); out_unlock: if (ret) ret = VM_FAULT_SIGBUS; diff --git a/mm/filemap.c b/mm/filemap.c index c641edf..3ed13a3 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2277,8 +2277,9 @@ EXPORT_SYMBOL(generic_file_direct_write); * Find or create a page at the given pagecache position. Return the locked * page. This function is specifically for buffered writes. */ -struct page *grab_cache_page_write_begin(struct address_space *mapping, - pgoff_t index, unsigned flags) +static struct page * +__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, + unsigned flags) { int status; struct page *page; @@ -2303,6 +2304,20 @@ repeat: } return page; } + +struct page *grab_cache_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + struct page *p; + + p = __grab_cache_page_write_begin(mapping, index, flags); + if (p) { + WARN_ON(!PageLocked(p)); + wait_on_page_writeback(p); + } + + return p; +} EXPORT_SYMBOL(grab_cache_page_write_begin); static ssize_t generic_perform_write(struct file *file, diff --git a/mm/memory.c b/mm/memory.c index 9da8cab..17fb560 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3146,6 +3146,9 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else VM_BUG_ON(!PageLocked(page)); page_mkwrite = 1; + } else { + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); } } diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index 9c5e6b2..7d2c57d 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -25,6 +25,7 @@ #include <linux/bio.h> #include <linux/workqueue.h> #include <linux/slab.h> +#include <asm/tlbflush.h> struct integrity_slab { struct kmem_cache *slab; @@ -382,6 +383,52 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi) return 0; } +static int set_page_ro(struct page *page) +{ + int set_memory_ro(unsigned long addr, int numpages); + unsigned long addr = (unsigned long)page_address(page); + if (PageHighMem(page)) + return 0; + return set_memory_ro(addr, 1); +} + +static int set_page_rw(struct page *page) +{ + int set_memory_rw(unsigned long addr, int numpages); + unsigned long addr = (unsigned long)page_address(page); + if (PageHighMem(page)) + return 0; + return set_memory_rw(addr, 1); +} + +static void bio_integrity_write_fn(struct work_struct *work) +{ + struct bio_integrity_payload *bip = + container_of(work, struct bio_integrity_payload, bip_work); + struct bio *bio = bip->bip_bio; + struct bio_vec *from; + int i; + + __bio_for_each_segment(from, bio, i, 0) { + set_page_rw(from->bv_page); + } + + /* Restore original bio completion handler */ + bio->bi_end_io = bip->bip_end_io; + bio_endio(bio, bip->bip_error); +} + +static void bio_integrity_endio_write(struct bio *bio, int error) +{ + struct bio_integrity_payload *bip = bio->bi_integrity; + + BUG_ON(bip->bip_bio != bio); + + bip->bip_error = error; + INIT_WORK(&bip->bip_work, bio_integrity_write_fn); + queue_work(kintegrityd_wq, &bip->bip_work); +} + /** * bio_integrity_prep - Prepare bio for integrity I/O * @bio: bio to prepare @@ -468,8 +515,30 @@ int bio_integrity_prep(struct bio *bio) } /* Auto-generate integrity metadata if this is a write */ - if (bio_data_dir(bio) == WRITE) + if (bio_data_dir(bio) == WRITE) { + struct bio_vec *from; + int i; + + bip->bip_end_io = bio->bi_end_io; + bio->bi_end_io = bio_integrity_endio_write; + __bio_for_each_segment(from, bio, i, 0) { + set_page_writeback(from->bv_page); +#if 0 + if (!PagePrivate(from->bv_page) && + !PageWriteback(from->bv_page) && + from->bv_page->mapping && + from->bv_page->mapping->host && + !from->bv_page->mapping->host->i_bdev) + { + printk(KERN_ERR "%s: writebacking file(?) page=%p flags=%lx mode=%x...\n", __FUNCTION__, from->bv_page, from->bv_page->flags, from->bv_page->mapping->host->i_mode); + set_page_writeback(from->bv_page); + } +#endif + set_page_ro(from->bv_page); + flush_tlb_all(); + } bio_integrity_generate(bio); + } return 0; } diff --git a/fs/buffer.c b/fs/buffer.c index dd429fe..02156c5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -376,8 +376,10 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate) if (!quiet_error(bh)) { buffer_io_error(bh); printk(KERN_WARNING "lost page write due to " - "I/O error on %s\n", - bdevname(bh->b_bdev, b)); + "I/O error on %s, flags=%lx\n", + bdevname(bh->b_bdev, b), page->flags); +if (page->mapping && page->mapping->host) +printk(KERN_ERR "mode=%x inode=%d dev=%d\n", page->mapping->host->i_mode, page->mapping->host->i_ino, (page->mapping->host->i_rdev)); } set_bit(AS_EIO, &page->mapping->flags); set_buffer_write_io_error(bh); diff --git a/include/linux/bio.h b/include/linux/bio.h index ce33e68..ea36e89 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -180,6 +180,7 @@ struct bio_integrity_payload { unsigned short bip_vcnt; /* # of integrity bio_vecs */ unsigned short bip_idx; /* current bip_vec index */ + int bip_error; /* bio completion status? */ struct work_struct bip_work; /* I/O completion */ struct bio_vec bip_vec[0]; /* embedded bvec array */ }; ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-08 20:31 ` Darrick J. Wong @ 2011-04-11 16:42 ` Jeff Layton 2011-04-11 17:41 ` Chris Mason 0 siblings, 1 reply; 67+ messages in thread From: Jeff Layton @ 2011-04-11 16:42 UTC (permalink / raw) To: djwong Cc: Jan Kara, Chris Mason, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Fri, 8 Apr 2011 13:31:35 -0700 "Darrick J. Wong" <djwong@us.ibm.com> wrote: > On Thu, Apr 07, 2011 at 06:57:00PM +0200, Jan Kara wrote: > > On Wed 06-04-11 16:29:38, Darrick J. Wong wrote: > > > On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote: > > > > On Mon 21-03-11 10:24:41, Chris Mason wrote: > > > > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400: > > > > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote: > > > > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add > > > > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case > > > > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by > > > > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait > > > > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by > > > > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more > > > > > > > > > going on in btrfs than I've been able to figure out. > > > > > > > > > > > > > > I wonder, is it possible for this to happen: > > > > > > > > > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes, > > > > > > > but there's no ongoing writeback, so it returns without delay. > > > > > > > 2. Thread A starts writing furiously to the page. > > > > > > > 3. Thread B runs fsync() or something that results in the page being > > > > > > > checksummed and scheduled for writeout. > > > > > > > 4. Thread A continues to write furiously(!) on that same page before the > > > > > > > controller finishes the DMA transfer. > > > > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom* > > > > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages()) > > > > > > is: > > > > > > lock_page(page) > > > > > > ... > > > > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in > > > > > > PTE > > > > > > ... > > > > > > set_page_writeback() (happens e.g. in __block_write_full_page() called > > > > > > from filesystem's writepage implementation). > > > > > > unlock_page(page) > > > > > > > > > > > > So if you compute the checksum after set_page_writeback() is done in the > > > > > > writepage() implementation (you cannot use __block_write_full_page() in > > > > > > that case) > > > > I should add that if you are computing the checksum in the block layer > > > > once the bio is submitted, you obviously are computing it after the page is > > > > marked as writeback. So that should be fine... > > > > > > > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite() > > > > > > under page lock, you should be safe. If you do all this and still see > > > > > > errors, something is broken I'd say... > > > > > > > > > > Looking at the ext4_page_mkwrite, it does this: > > > > > > > > > > lock the page > > > > > check for holes > > > > > unlock the page > > > > > if (no_holes) > > > > > return; > > > > > > > > > > write_begin/write_end > > > > > return > > > > > > > > > > So, to have page_mkwrite work, you need to wait for writeback with the > > > > > page locked in both the no holes case and after the > > > > > write_begin/write_end. write_begin will dirty the page, so someone can > > > > > wander in and start the IO while we are still in page_mkwrite. > > > > Oh right, that's a good point. > > > > > > > > > This is untested and uncompiled, but it should > > > > > do the trick. > > > > > > > > > > Jan, did you get rid of all the buffer head based writeback for > > > > > data=ordered in ext4? That's my only other idea, that someone is doing > > > > > writeback directly without taking the page lock. > > > > Yes, ext4 shouldn't do any buffer based writeback. > > > > > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > > > index 9f7f9e4..8a75e12 100644 > > > > > --- a/fs/ext4/inode.c > > > > > +++ b/fs/ext4/inode.c > > > > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > > if (page_has_buffers(page)) { > > > > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > > > > ext4_bh_unmapped)) { > > > > > + wait_on_page_writeback(page); > > > > > unlock_page(page); > > > > > goto out_unlock; > > > > > } > > > > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > > if (ret < 0) > > > > > goto out_unlock; > > > > > ret = 0; > > > > > + > > > > > + /* > > > > > + * write_begin/end might have created a dirty page and someone > > > > > + * could wander in and start the IO. Make sure that hasn't > > > > > + * happened > > > > > + */ > > > > > + lock_page(page); > > > > > + wait_on_page_writeback(page); > > > > > + unlock_page(page); > > > > > + > > > > > out_unlock: > > > > > if (ret) > > > > > ret = VM_FAULT_SIGBUS; > > > > > > > > > This looks good AFAICT. > > > > > > I gave this a spin a couple of weeks ago (and accidentally left the test > > > machines running for a full week!) From what I can tell, with all the various > > > wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors > > > down to about 7-8 per day. Not bad, but not fixed. > > Ugh, strange. Can you post the full patch you are currently using? I've > > already lost track of all the proposed changes... Thanks. > > Yes, me too. Attached is the giant patch I've been working on. > > --D > > fs: Wait for page writeback when rewrite detected, and mark pages ro during writeback > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > diff --git a/fs/buffer.c b/fs/buffer.c > index a08bb8e..dd429fe 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2357,6 +2357,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > else > end = PAGE_CACHE_SIZE; > > + WARN_ON(!PageLocked(page)); > + wait_on_page_writeback(page); > ret = __block_write_begin(page, 0, end, get_block); > if (!ret) > ret = block_commit_write(page, 0, end); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 1a86282..57cd028 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2666,6 +2666,8 @@ static int ext4_writepage(struct page *page, > */ > goto redirty_page; > } > + WARN_ON(!PageLocked(page)); > + wait_on_page_writeback(page); > if (commit_write) > /* now mark the buffer_heads as dirty and uptodate */ > block_commit_write(page, 0, len); > @@ -2778,7 +2780,8 @@ static int write_cache_pages_da(struct address_space *mapping, > } > > lock_page(page); > - > + if (PageWriteback(page)) > + wait_on_page_writeback(page); > /* > * If the page is no longer dirty, or its > * mapping no longer corresponds to inode we > @@ -5803,12 +5806,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > if (PageMappedToDisk(page)) > goto out_unlock; > > + lock_page(page); > + /* this one seems to handle mmap */ > + wait_on_page_writeback(page); > if (page->index == size >> PAGE_CACHE_SHIFT) > len = size & ~PAGE_CACHE_MASK; > else > len = PAGE_CACHE_SIZE; > > - lock_page(page); > /* > * return if we have all the buffers mapped. This avoid > * the need to call write_begin/write_end which does a > @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > if (ret < 0) > goto out_unlock; > ret = 0; > + > + /* > + * write_begin/end might have created a dirty page and someone > + * could wander in and start the IO. Make sure that hasn't > + * happened. > + */ > + lock_page(page); > + wait_on_page_writeback(page); > + unlock_page(page); nit: The callers of page_mkwrite always lock the page afterward if you return from page_mkwrite with it unlocked. If you plan to take page lock anyway, it's probably slightly more efficient not to unlock it and instead return VM_FAULT_LOCKED. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-11 16:42 ` Jeff Layton @ 2011-04-11 17:41 ` Chris Mason 2011-04-11 18:25 ` Christoph Hellwig 2011-04-12 0:46 ` Mingming Cao 0 siblings, 2 replies; 67+ messages in thread From: Chris Mason @ 2011-04-11 17:41 UTC (permalink / raw) To: Jeff Layton Cc: djwong, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Jeff Layton's message of 2011-04-11 12:42:29 -0400: > > @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > if (ret < 0) > > goto out_unlock; > > ret = 0; > > + > > + /* > > + * write_begin/end might have created a dirty page and someone > > + * could wander in and start the IO. Make sure that hasn't > > + * happened. > > + */ > > + lock_page(page); > > + wait_on_page_writeback(page); > > + unlock_page(page); > > nit: > > The callers of page_mkwrite always lock the page afterward if you > return from page_mkwrite with it unlocked. If you plan to take page > lock anyway, it's probably slightly more efficient not to unlock it and > instead return VM_FAULT_LOCKED. > Actually this isn't a nit. Keeping the page locked closes an important hole where it can become writeback again. It might fix the last remaining problem. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-11 17:41 ` Chris Mason @ 2011-04-11 18:25 ` Christoph Hellwig 2011-04-11 18:38 ` Chris Mason 2011-04-12 0:46 ` Mingming Cao 1 sibling, 1 reply; 67+ messages in thread From: Christoph Hellwig @ 2011-04-11 18:25 UTC (permalink / raw) To: Chris Mason Cc: Jeff Layton, djwong, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon, Apr 11, 2011 at 01:41:16PM -0400, Chris Mason wrote: > Actually this isn't a nit. Keeping the page locked closes an important > hole where it can become writeback again. It might fix the last > remaining problem. Note that the generic block_page_mkwrite atually does that. But only xfs and nilfs2 actually make use of it. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-11 18:25 ` Christoph Hellwig @ 2011-04-11 18:38 ` Chris Mason 0 siblings, 0 replies; 67+ messages in thread From: Chris Mason @ 2011-04-11 18:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Jeff Layton, djwong, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi Excerpts from Christoph Hellwig's message of 2011-04-11 14:25:41 -0400: > On Mon, Apr 11, 2011 at 01:41:16PM -0400, Chris Mason wrote: > > Actually this isn't a nit. Keeping the page locked closes an important > > hole where it can become writeback again. It might fix the last > > remaining problem. > > Note that the generic block_page_mkwrite atually does that. But only > xfs and nilfs2 actually make use of it. > Btrfs does return VM_FAULT_LOCKED though. I had forgotten to look for this in the ext4 page_mkwrite call. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-11 17:41 ` Chris Mason 2011-04-11 18:25 ` Christoph Hellwig @ 2011-04-12 0:46 ` Mingming Cao 2011-04-12 0:57 ` Christoph Hellwig 1 sibling, 1 reply; 67+ messages in thread From: Mingming Cao @ 2011-04-12 0:46 UTC (permalink / raw) To: Chris Mason Cc: Jeff Layton, djwong, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon, 2011-04-11 at 13:41 -0400, Chris Mason wrote: > Excerpts from Jeff Layton's message of 2011-04-11 12:42:29 -0400: > > > @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > if (ret < 0) > > > goto out_unlock; > > > ret = 0; > > > + > > > + /* > > > + * write_begin/end might have created a dirty page and someone > > > + * could wander in and start the IO. Make sure that hasn't > > > + * happened. > > > + */ > > > + lock_page(page); > > > + wait_on_page_writeback(page); > > > + unlock_page(page); > > I am little puzzled here. if someone wander in and start the IO, the page is up-to-date (dirtied by this page_mkwrite). We shouldn't see the checksum inconsistancy, right? > > nit: > > > > The callers of page_mkwrite always lock the page afterward if you > > return from page_mkwrite with it unlocked. If you plan to take page > > lock anyway, it's probably slightly more efficient not to unlock it and > > instead return VM_FAULT_LOCKED. > > > > Actually this isn't a nit. Keeping the page locked closes an important > hole where it can become writeback again. It might fix the last > remaining problem. > Oh, right. Currently ext4_page_mkwrite drops the page lock before calling it's dirty the page (by write_begin() and write_end(). I suspect regrab the lock() after write_end() (with your proposed change) and returning with locked still leave the dirty by ext4_page_mkwrite unlocked. We probably should to keep the page locked the page during the entire ext4_page_mkwrite() call. Any reason to drop the page lock() before calling aops->write_begin()? Mingming ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-12 0:46 ` Mingming Cao @ 2011-04-12 0:57 ` Christoph Hellwig 2011-04-14 0:48 ` Mingming Cao 0 siblings, 1 reply; 67+ messages in thread From: Christoph Hellwig @ 2011-04-12 0:57 UTC (permalink / raw) To: Mingming Cao Cc: Chris Mason, Jeff Layton, djwong, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon, Apr 11, 2011 at 05:46:52PM -0700, Mingming Cao wrote: > Oh, right. Currently ext4_page_mkwrite drops the page lock before > calling it's dirty the page (by write_begin() and write_end(). I > suspect regrab the lock() after write_end() (with your proposed change) > and returning with locked still leave the dirty by ext4_page_mkwrite > unlocked. We probably should to keep the page locked the page during > the entire ext4_page_mkwrite() call. Any reason to drop the page lock() > before calling aops->write_begin()? write_begin takes the page lock by itself. That's one of the reasons why block_page_mkwrite doesn't use plain ->write_begin / write_end, the other beeing that we already get a page passed to use, so there's no need to do the pagecache lookup or allocation done by grab_cache_page_write_begin. The best thing would be to completely drop ext4's current version of page_mkwrite and start out with a copy of block_page_mkwrite which has the journalling calls added back into it. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] block integrity: Fix write after checksum calculation problem 2011-04-12 0:57 ` Christoph Hellwig @ 2011-04-14 0:48 ` Mingming Cao 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Mingming Cao @ 2011-04-14 0:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Chris Mason, Jeff Layton, djwong, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi On Mon, 2011-04-11 at 20:57 -0400, Christoph Hellwig wrote: > On Mon, Apr 11, 2011 at 05:46:52PM -0700, Mingming Cao wrote: > > Oh, right. Currently ext4_page_mkwrite drops the page lock before > > calling it's dirty the page (by write_begin() and write_end(). I > > suspect regrab the lock() after write_end() (with your proposed change) > > and returning with locked still leave the dirty by ext4_page_mkwrite > > unlocked. We probably should to keep the page locked the page during > > the entire ext4_page_mkwrite() call. Any reason to drop the page lock() > > before calling aops->write_begin()? > > write_begin takes the page lock by itself. That's one of the reasons why > block_page_mkwrite doesn't use plain ->write_begin / write_end, the > other beeing that we already get a page passed to use, so there's no > need to do the pagecache lookup or allocation done by > grab_cache_page_write_begin. > > The best thing would be to completely drop ext4's current version > of page_mkwrite and start out with a copy of block_page_mkwrite which > has the journalling calls added back into it. The problem is the locking order, we can't hold page lock then start the journal lock. Kjournald will need to hold the journal lock first, then commit, commit may need to callback writepages, which need to hold the page lock. I looked at ext3, in that case, ext3 even don't have ext3_page_mkwrite() to do the stable page yet. It requires some block reservation/delayed allocation for filling holes in mmaped IO case. Jan tried that before I don't think the proposal get far. Now looking back ext4_page_mkwrite(), it calls write_begin(), as long as we do wait in write_begin() things would be fine, right? It seems Darrick already did that wait per Dave Chinner's suggestion when grab the page and lock that page. But still a puzzle to me why that's not sufficient. Mingming > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-14 0:48 ` Mingming Cao @ 2011-04-22 0:02 ` Darrick J. Wong 2011-04-22 12:50 ` Chris Mason ` (4 more replies) 0 siblings, 5 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-04-22 0:02 UTC (permalink / raw) To: Mingming Cao Cc: Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen On Wed, Apr 13, 2011 at 05:48:48PM -0700, Mingming Cao wrote: > On Mon, 2011-04-11 at 20:57 -0400, Christoph Hellwig wrote: > > On Mon, Apr 11, 2011 at 05:46:52PM -0700, Mingming Cao wrote: > > > Oh, right. Currently ext4_page_mkwrite drops the page lock before > > > calling it's dirty the page (by write_begin() and write_end(). I > > > suspect regrab the lock() after write_end() (with your proposed change) > > > and returning with locked still leave the dirty by ext4_page_mkwrite > > > unlocked. We probably should to keep the page locked the page during > > > the entire ext4_page_mkwrite() call. Any reason to drop the page lock() > > > before calling aops->write_begin()? > > > > write_begin takes the page lock by itself. That's one of the reasons why > > block_page_mkwrite doesn't use plain ->write_begin / write_end, the > > other beeing that we already get a page passed to use, so there's no > > need to do the pagecache lookup or allocation done by > > grab_cache_page_write_begin. > > > > The best thing would be to completely drop ext4's current version > > of page_mkwrite and start out with a copy of block_page_mkwrite which > > has the journalling calls added back into it. > > The problem is the locking order, we can't hold page lock then start the > journal lock. Kjournald will need to hold the journal lock first, then > commit, commit may need to callback writepages, which need to hold the > page lock. > > > I looked at ext3, in that case, ext3 even don't have ext3_page_mkwrite() > to do the stable page yet. It requires some block reservation/delayed > allocation for filling holes in mmaped IO case. Jan tried that before I > don't think the proposal get far. > > > Now looking back ext4_page_mkwrite(), it calls write_begin(), as long as > we do wait in write_begin() things would be fine, right? It seems > Darrick already did that wait per Dave Chinner's suggestion when grab > the page and lock that page. But still a puzzle to me why that's not > sufficient. Hi everyone, I've finally managed to get together a patch that seems to provide stable pages during writeback, or at least gets us to the point that after several days of running tests I don't see DIF checksum errors anymore. :) The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to walk the process tree to find all userland ptes that map to a particular memory page and revoke write access, and (b) ext4_page_mkwrite should always lock (and wait on) a page before returning. There are still some huge warts with the patch; in particular, the x86-specific set_memory_r[ow] needs to go away, which is what I was trying to do with the #ifdef USE_X86 case. Also, there are probably more wait_for_page_writeback()s than are really necessary. Going forward, however, it might be easier to take all those waits out and simply copy-migrate the page to another location when the page fault handler sees the write-during-io case, so I'll look into that tomorrow. I'm also not particularly sure what happens when huge pages get involved. Debug-quality patch follows, before my kernel-build disks crash again. --D --- diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index 9c5e6b2..8031bd7 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -25,6 +25,7 @@ #include <linux/bio.h> #include <linux/workqueue.h> #include <linux/slab.h> +#include <asm/tlbflush.h> struct integrity_slab { struct kmem_cache *slab; @@ -382,6 +383,110 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi) return 0; } +#define USE_X86 + +#ifdef USE_X86 +static int set_page_ro(struct page *page) +{ + int set_memory_ro(unsigned long addr, int numpages); + unsigned long addr = (unsigned long)page_address(page); + if (PageHighMem(page)) + return 0; + return set_memory_ro(addr, 1); +} + +static int set_page_rw(struct page *page) +{ + int set_memory_rw(unsigned long addr, int numpages); + unsigned long addr = (unsigned long)page_address(page); + if (PageHighMem(page)) + return 0; + return set_memory_rw(addr, 1); +} +#endif + +#include <linux/rmap.h> +static void set_bio_page_access(struct mm_struct *mm, struct bio *bio, int rw) +{ + struct bio_vec *from; + int i; + + __bio_for_each_segment(from, bio, i, 0) { + struct vm_area_struct *vma; + int modified = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long address; + pte_t *pte; + spinlock_t *ptl; + + // grab pte, change to ro? + address = page_address_in_vma(from->bv_page, vma); + if (address == -EFAULT) /* out of vma range */ + continue; + pte = page_check_address(from->bv_page, vma->vm_mm, address, &ptl, 1); + if (!pte) /* the page is not in this mm */ + continue; + modified = 1; + if (rw) { + pte_t old_pte = *pte; + set_pte_at(mm, address, pte, pte_mkwrite(old_pte)); + } else + ptep_set_wrprotect(mm, address, pte); + pte_unmap_unlock(pte, ptl); + } + if (modified) + flush_tlb_mm(mm); + } +} + +static void set_bio_page_rw(struct mm_struct *mm, struct bio *bio) +{ + set_bio_page_access(mm, bio, 1); +} + +static void set_bio_page_ro(struct mm_struct *mm, struct bio *bio) +{ + set_bio_page_access(mm, bio, 0); +} + +#ifdef USE_X86 +static void bio_integrity_write_fn(struct work_struct *work) +{ + struct bio_integrity_payload *bip = + container_of(work, struct bio_integrity_payload, bip_work); + struct bio *bio = bip->bip_bio; + struct bio_vec *from; + int i; + + __bio_for_each_segment(from, bio, i, 0) { + set_page_rw(from->bv_page); + } + + /* Restore original bio completion handler */ + bio->bi_end_io = bip->bip_end_io; + bio_endio(bio, bip->bip_error); +} +#endif + +static void bio_integrity_endio_write(struct bio *bio, int error) +{ + struct bio_integrity_payload *bip = bio->bi_integrity; + + BUG_ON(bip->bip_bio != bio); + +#ifndef USE_X86 + set_bio_page_rw(&init_mm, bio); + bio->bi_end_io = bip->bip_end_io; + bio_endio(bio, bip->bip_error); + return; +#else + bip->bip_error = error; + INIT_WORK(&bip->bip_work, bio_integrity_write_fn); + queue_work(kintegrityd_wq, &bip->bip_work); +#endif +} + /** * bio_integrity_prep - Prepare bio for integrity I/O * @bio: bio to prepare @@ -468,8 +573,42 @@ int bio_integrity_prep(struct bio *bio) } /* Auto-generate integrity metadata if this is a write */ - if (bio_data_dir(bio) == WRITE) + if (bio_data_dir(bio) == WRITE) { + struct task_struct *p; + struct mm_struct *mm; + + bip->bip_end_io = bio->bi_end_io; + bio->bi_end_io = bio_integrity_endio_write; + +#ifndef USE_X86 +// kmap_flush_unused(); +// vm_unmap_aliases(); + set_bio_page_ro(&init_mm, bio); + flush_tlb_all(); +#else + { + struct bio_vec *from; + int i; + + __bio_for_each_segment(from, bio, i, 0) { + set_page_ro(from->bv_page); + } + } +#endif + + for_each_process(p) { + task_lock(p); + mm = p->mm; + if (!mm) { + task_unlock(p); + continue; + } + + set_bio_page_ro(mm, bio); + task_unlock(p); + } bio_integrity_generate(bio); + } return 0; } diff --git a/fs/buffer.c b/fs/buffer.c index a08bb8e..02156c5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -376,8 +376,10 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate) if (!quiet_error(bh)) { buffer_io_error(bh); printk(KERN_WARNING "lost page write due to " - "I/O error on %s\n", - bdevname(bh->b_bdev, b)); + "I/O error on %s, flags=%lx\n", + bdevname(bh->b_bdev, b), page->flags); +if (page->mapping && page->mapping->host) +printk(KERN_ERR "mode=%x inode=%d dev=%d\n", page->mapping->host->i_mode, page->mapping->host->i_ino, (page->mapping->host->i_rdev)); } set_bit(AS_EIO, &page->mapping->flags); set_buffer_write_io_error(bh); @@ -2357,6 +2359,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, else end = PAGE_CACHE_SIZE; + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); ret = __block_write_begin(page, 0, end, get_block); if (!ret) ret = block_commit_write(page, 0, end); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f2fa5e8..228b4aa 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2667,6 +2667,8 @@ static int ext4_writepage(struct page *page, */ goto redirty_page; } + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); if (commit_write) /* now mark the buffer_heads as dirty and uptodate */ block_commit_write(page, 0, len); @@ -2779,7 +2781,8 @@ static int write_cache_pages_da(struct address_space *mapping, } lock_page(page); - + if (PageWriteback(page)) + wait_on_page_writeback(page); /* * If the page is no longer dirty, or its * mapping no longer corresponds to inode we @@ -5811,15 +5814,21 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) goto out_unlock; } ret = 0; - if (PageMappedToDisk(page)) - goto out_unlock; + lock_page(page); + if (PageWriteback(page)) + wait_on_page_writeback(page); + if (PageMappedToDisk(page)) { + up_read(&inode->i_alloc_sem); + return VM_FAULT_LOCKED; + } + /* this one seems to handle mmap */ + wait_on_page_writeback(page); if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; - lock_page(page); /* * return if we have all the buffers mapped. This avoid * the need to call write_begin/write_end which does a @@ -5829,8 +5838,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - unlock_page(page); - goto out_unlock; + up_read(&inode->i_alloc_sem); + return VM_FAULT_LOCKED; } } unlock_page(page); @@ -5850,6 +5859,17 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (ret < 0) goto out_unlock; ret = 0; + + /* + * write_begin/end might have created a dirty page and someone + * could wander in and start the IO. Make sure that hasn't + * happened. + */ + lock_page(page); + if (PageWriteback(page)) + wait_on_page_writeback(page); + up_read(&inode->i_alloc_sem); + return VM_FAULT_LOCKED; out_unlock: if (ret) ret = VM_FAULT_SIGBUS; diff --git a/include/linux/bio.h b/include/linux/bio.h index ce33e68..ea36e89 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -180,6 +180,7 @@ struct bio_integrity_payload { unsigned short bip_vcnt; /* # of integrity bio_vecs */ unsigned short bip_idx; /* current bip_vec index */ + int bip_error; /* bio completion status? */ struct work_struct bip_work; /* I/O completion */ struct bio_vec bip_vec[0]; /* embedded bvec array */ }; diff --git a/mm/filemap.c b/mm/filemap.c index c641edf..3ed13a3 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2277,8 +2277,9 @@ EXPORT_SYMBOL(generic_file_direct_write); * Find or create a page at the given pagecache position. Return the locked * page. This function is specifically for buffered writes. */ -struct page *grab_cache_page_write_begin(struct address_space *mapping, - pgoff_t index, unsigned flags) +static struct page * +__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, + unsigned flags) { int status; struct page *page; @@ -2303,6 +2304,20 @@ repeat: } return page; } + +struct page *grab_cache_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + struct page *p; + + p = __grab_cache_page_write_begin(mapping, index, flags); + if (p) { + WARN_ON(!PageLocked(p)); + wait_on_page_writeback(p); + } + + return p; +} EXPORT_SYMBOL(grab_cache_page_write_begin); static ssize_t generic_perform_write(struct file *file, diff --git a/mm/memory.c b/mm/memory.c index ce22a25..9d3dc5d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3156,6 +3156,9 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else VM_BUG_ON(!PageLocked(page)); page_mkwrite = 1; + } else { + WARN_ON(!PageLocked(page)); + wait_on_page_writeback(page); } } ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong @ 2011-04-22 12:50 ` Chris Mason 2011-04-22 20:34 ` Jan Kara 2011-05-04 17:37 ` [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-04-22 12:50 UTC (permalink / raw) To: djwong Cc: Mingming Cao, Christoph Hellwig, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > Hi everyone, > > I've finally managed to get together a patch that seems to provide stable pages > during writeback, or at least gets us to the point that after several days of > running tests I don't see DIF checksum errors anymore. :) > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > walk the process tree to find all userland ptes that map to a particular memory > page and revoke write access, and Hmm, did you need the bio_integrity_prep change for all the filesystems? This should be happening already as part of using page_mkwrite. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-22 12:50 ` Chris Mason @ 2011-04-22 20:34 ` Jan Kara 2011-04-26 0:37 ` Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Jan Kara @ 2011-04-22 20:34 UTC (permalink / raw) To: Chris Mason Cc: djwong, Mingming Cao, Christoph Hellwig, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen On Fri 22-04-11 08:50:01, Chris Mason wrote: > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > > Hi everyone, > > > > I've finally managed to get together a patch that seems to provide stable pages > > during writeback, or at least gets us to the point that after several days of > > running tests I don't see DIF checksum errors anymore. :) > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > > walk the process tree to find all userland ptes that map to a particular memory > > page and revoke write access, and > > Hmm, did you need the bio_integrity_prep change for all the filesystems? > This should be happening already as part of using page_mkwrite. Or more precisely page_mkclean() should do what you try to do in bio_integrity_prep()... It would certainly be interesting (bug) if you could write to the page after calling page_mkclean() without page_mkwrite() being called. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-22 20:34 ` Jan Kara @ 2011-04-26 0:37 ` Darrick J. Wong 2011-04-26 11:33 ` Chris Mason 2011-04-26 11:37 ` Jan Kara 0 siblings, 2 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-04-26 0:37 UTC (permalink / raw) To: Jan Kara Cc: Chris Mason, Mingming Cao, Christoph Hellwig, Jeff Layton, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote: > On Fri 22-04-11 08:50:01, Chris Mason wrote: > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > > > Hi everyone, > > > > > > I've finally managed to get together a patch that seems to provide stable pages > > > during writeback, or at least gets us to the point that after several days of > > > running tests I don't see DIF checksum errors anymore. :) > > > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > > > walk the process tree to find all userland ptes that map to a particular memory > > > page and revoke write access, and > > > > Hmm, did you need the bio_integrity_prep change for all the filesystems? > > This should be happening already as part of using page_mkwrite. > Or more precisely page_mkclean() should do what you try to do in > bio_integrity_prep()... It would certainly be interesting (bug) if you > could write to the page after calling page_mkclean() without page_mkwrite() > being called. Hm... in mpage_da_submit_io I see the following sequence of calls: 1. clear_page_dirty_for_io 2. possibly one of: ext4_bio_write_page or block_write_full_page. If ext4_bio_write_page, 2a. kmem_cache_alloc 2b. set_page_writeback Before and after #1, the page is locked but writeback is not set. Before #2, the page must be locked and writeback must not be set, because both of those two functions want to set the writeback bit themselves. However, ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can sleep (I think). Unfortunately, ext4_page_mkwrite will check for page locked, wait for page writeback, and then return the page. I think it is theoretically possible for #1 to trigger a page_mkwrite which completes before #2b, right? In which case the thread that called mkwrite will think that the page isn't being written out, and happily scribble on it during writeback. I could be wrong, but it seems to me that one has to write-protect the page after setting the writeback bit. I guess we could call page_mkclean from bio_integrity_prep, though. --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-26 0:37 ` Darrick J. Wong @ 2011-04-26 11:33 ` Chris Mason 2011-05-03 1:59 ` Darrick J. Wong 2011-04-26 11:37 ` Jan Kara 1 sibling, 1 reply; 67+ messages in thread From: Chris Mason @ 2011-04-26 11:33 UTC (permalink / raw) To: djwong Cc: Jan Kara, Mingming Cao, Christoph Hellwig, Jeff Layton, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen Excerpts from Darrick J. Wong's message of 2011-04-25 20:37:38 -0400: > On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote: > > On Fri 22-04-11 08:50:01, Chris Mason wrote: > > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > > > > Hi everyone, > > > > > > > > I've finally managed to get together a patch that seems to provide stable pages > > > > during writeback, or at least gets us to the point that after several days of > > > > running tests I don't see DIF checksum errors anymore. :) > > > > > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > > > > walk the process tree to find all userland ptes that map to a particular memory > > > > page and revoke write access, and > > > > > > Hmm, did you need the bio_integrity_prep change for all the filesystems? > > > This should be happening already as part of using page_mkwrite. > > Or more precisely page_mkclean() should do what you try to do in > > bio_integrity_prep()... It would certainly be interesting (bug) if you > > could write to the page after calling page_mkclean() without page_mkwrite() > > being called. > > Hm... in mpage_da_submit_io I see the following sequence of calls: > > 1. clear_page_dirty_for_io > 2. possibly one of: ext4_bio_write_page or block_write_full_page. > If ext4_bio_write_page, > 2a. kmem_cache_alloc > 2b. set_page_writeback > > Before and after #1, the page is locked but writeback is not set. > > Before #2, the page must be locked and writeback must not be set, because both > of those two functions want to set the writeback bit themselves. However, > ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can > sleep (I think). Sleeping isn't the problem as long as you sleep with the page locked. The idea is that writepage will: 1) lock the page 2) clear_page_dirty_for_io (which calls page_mkclean) 3) set_page_writeback() 4) unlock the page 5) start the IO page_mkwrite will: 1) lock the page 2) wait on page writeback 3) do other stuff So if ext is calling set_page_writeback() on an unlocked page, that's a problem. Otherwise it should be working. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-26 11:33 ` Chris Mason @ 2011-05-03 1:59 ` Darrick J. Wong 2011-05-04 1:26 ` Darrick J. Wong 0 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-05-03 1:59 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, Mingming Cao, Christoph Hellwig, Jeff Layton, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen On Tue, Apr 26, 2011 at 07:33:18AM -0400, Chris Mason wrote: > Excerpts from Darrick J. Wong's message of 2011-04-25 20:37:38 -0400: > > On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote: > > > On Fri 22-04-11 08:50:01, Chris Mason wrote: > > > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > > > > > Hi everyone, > > > > > > > > > > I've finally managed to get together a patch that seems to provide stable pages > > > > > during writeback, or at least gets us to the point that after several days of > > > > > running tests I don't see DIF checksum errors anymore. :) > > > > > > > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > > > > > walk the process tree to find all userland ptes that map to a particular memory > > > > > page and revoke write access, and > > > > > > > > Hmm, did you need the bio_integrity_prep change for all the filesystems? > > > > This should be happening already as part of using page_mkwrite. > > > Or more precisely page_mkclean() should do what you try to do in > > > bio_integrity_prep()... It would certainly be interesting (bug) if you > > > could write to the page after calling page_mkclean() without page_mkwrite() > > > being called. > > > > Hm... in mpage_da_submit_io I see the following sequence of calls: > > > > 1. clear_page_dirty_for_io > > 2. possibly one of: ext4_bio_write_page or block_write_full_page. > > If ext4_bio_write_page, > > 2a. kmem_cache_alloc > > 2b. set_page_writeback > > > > Before and after #1, the page is locked but writeback is not set. > > > > Before #2, the page must be locked and writeback must not be set, because both > > of those two functions want to set the writeback bit themselves. However, > > ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can > > sleep (I think). > > Sleeping isn't the problem as long as you sleep with the page locked. > The idea is that writepage will: > > 1) lock the page > 2) clear_page_dirty_for_io (which calls page_mkclean) > 3) set_page_writeback() > 4) unlock the page > 5) start the IO > > page_mkwrite will: > > 1) lock the page > 2) wait on page writeback > 3) do other stuff > > So if ext is calling set_page_writeback() on an unlocked page, that's a > problem. Otherwise it should be working. You're right, at this point in time writepage and page_mkwrite in ext4 both behave as you describe. I began backing out parts of my patches to bio-integrity.c and discovered that with the current kernel (2.6.39-rc5) the only part that seems useful is the set_memory_ro/rw pair from that old debugging patch. Unfortunately, those two functions only seem to exist on x86; I suppose I could port them to others. If that's even a sane idea. --D > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-05-03 1:59 ` Darrick J. Wong @ 2011-05-04 1:26 ` Darrick J. Wong 0 siblings, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 1:26 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, Mingming Cao, Christoph Hellwig, Jeff Layton, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen On Mon, May 02, 2011 at 06:59:31PM -0700, Darrick J. Wong wrote: > On Tue, Apr 26, 2011 at 07:33:18AM -0400, Chris Mason wrote: > > Excerpts from Darrick J. Wong's message of 2011-04-25 20:37:38 -0400: > > > On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote: > > > > On Fri 22-04-11 08:50:01, Chris Mason wrote: > > > > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > > > > > > Hi everyone, > > > > > > > > > > > > I've finally managed to get together a patch that seems to provide stable pages > > > > > > during writeback, or at least gets us to the point that after several days of > > > > > > running tests I don't see DIF checksum errors anymore. :) > > > > > > > > > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > > > > > > walk the process tree to find all userland ptes that map to a particular memory > > > > > > page and revoke write access, and > > > > > > > > > > Hmm, did you need the bio_integrity_prep change for all the filesystems? > > > > > This should be happening already as part of using page_mkwrite. > > > > Or more precisely page_mkclean() should do what you try to do in > > > > bio_integrity_prep()... It would certainly be interesting (bug) if you > > > > could write to the page after calling page_mkclean() without page_mkwrite() > > > > being called. > > > > > > Hm... in mpage_da_submit_io I see the following sequence of calls: > > > > > > 1. clear_page_dirty_for_io > > > 2. possibly one of: ext4_bio_write_page or block_write_full_page. > > > If ext4_bio_write_page, > > > 2a. kmem_cache_alloc > > > 2b. set_page_writeback > > > > > > Before and after #1, the page is locked but writeback is not set. > > > > > > Before #2, the page must be locked and writeback must not be set, because both > > > of those two functions want to set the writeback bit themselves. However, > > > ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can > > > sleep (I think). > > > > Sleeping isn't the problem as long as you sleep with the page locked. > > The idea is that writepage will: > > > > 1) lock the page > > 2) clear_page_dirty_for_io (which calls page_mkclean) > > 3) set_page_writeback() > > 4) unlock the page > > 5) start the IO > > > > page_mkwrite will: > > > > 1) lock the page > > 2) wait on page writeback > > 3) do other stuff > > > > So if ext is calling set_page_writeback() on an unlocked page, that's a > > problem. Otherwise it should be working. > > You're right, at this point in time writepage and page_mkwrite in ext4 both > behave as you describe. I began backing out parts of my patches to > bio-integrity.c and discovered that with the current kernel (2.6.39-rc5) the > only part that seems useful is the set_memory_ro/rw pair from that old > debugging patch. Unfortunately, those two functions only seem to exist on x86; > I suppose I could port them to others. If that's even a sane idea. Never mind, I looked around for anything that would result in the kernel trying to map a page for the purpose of writing, and I think adding a wait_on_writeback to get_cache_page_for_write will solve that last hole without having to port set_memory_* to other arches. With that, a whole lot of code falls out of the patches, which I will post shortly. --D > > --D > > > > -chris > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC v2] block integrity: Stabilize(?) pages during writeback 2011-04-26 0:37 ` Darrick J. Wong 2011-04-26 11:33 ` Chris Mason @ 2011-04-26 11:37 ` Jan Kara 1 sibling, 0 replies; 67+ messages in thread From: Jan Kara @ 2011-04-26 11:37 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Chris Mason, Mingming Cao, Christoph Hellwig, Jeff Layton, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen On Mon 25-04-11 17:37:38, Darrick J. Wong wrote: > On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote: > > On Fri 22-04-11 08:50:01, Chris Mason wrote: > > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400: > > > > Hi everyone, > > > > > > > > I've finally managed to get together a patch that seems to provide stable pages > > > > during writeback, or at least gets us to the point that after several days of > > > > running tests I don't see DIF checksum errors anymore. :) > > > > > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to > > > > walk the process tree to find all userland ptes that map to a particular memory > > > > page and revoke write access, and > > > > > > Hmm, did you need the bio_integrity_prep change for all the filesystems? > > > This should be happening already as part of using page_mkwrite. > > Or more precisely page_mkclean() should do what you try to do in > > bio_integrity_prep()... It would certainly be interesting (bug) if you > > could write to the page after calling page_mkclean() without page_mkwrite() > > being called. > > Hm... in mpage_da_submit_io I see the following sequence of calls: > > 1. clear_page_dirty_for_io > 2. possibly one of: ext4_bio_write_page or block_write_full_page. > If ext4_bio_write_page, > 2a. kmem_cache_alloc > 2b. set_page_writeback > > Before and after #1, the page is locked but writeback is not set. > > Before #2, the page must be locked and writeback must not be set, because both > of those two functions want to set the writeback bit themselves. However, > ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can > sleep (I think). Yes, it can sleep. But the page remains locked until we set page as writeback in both cases. > Unfortunately, ext4_page_mkwrite will check for page locked, wait for > page writeback, and then return the page. I think it is theoretically > possible for #1 to trigger a page_mkwrite which completes before #2b, > right? I'm not sure I understand but once the page is locked by ext4_writepages() before #1, ext4_page_mkwrite() will block until it can get the page lock - which can happen only after set_page_writeback() in #2 is done. So then ext4_page_mkwrite() will block waiting for PageWriteback which gets cleared after the IO is finished... Or did you mean something else? > In which case the thread that called mkwrite will think that the > page isn't being written out, and happily scribble on it during > writeback. I could be wrong, but it seems to me that one has to > write-protect the page after setting the writeback bit. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong 2011-04-22 12:50 ` Chris Mason @ 2011-05-04 17:37 ` Darrick J. Wong 2011-05-04 18:46 ` Christoph Hellwig 2011-05-04 17:39 ` [PATCH v3 1/3] ext4: Clean up some wait_on_page_writeback calls Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 17:37 UTC (permalink / raw) To: Theodore Ts'o Cc: Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm Hi all, This is v3 of the stable-page-writes patchset for ext4. A lot of code has been cut out since v2 of this patch set. For v3, the large hairy function to walk the page tables of every process is gone since Chris Mason pointed out that page_mkclean does what I need. The set_memory_* hack is also gone, since (I think) the only time the kernel maps a file data blocks for writing is in the buffered IO case. That leaves us with some surgery to ext4_page_mkwrite to return locked pages and to be careful about re-checking the writeback status after dropping and re-grabbing the page lock; and a slight modification to the mm code to wait for page writeback when grabbing pages for buffered writes. There are also some cleanups for wait_on_page_writeback use in ext4. I ran my write-after-checksum ("wac") reproducer program to try to create the DIF checksum errors by madly rewriting the same memory pages. In fact, I tried the following combinations: a. 64 write() threads + sync_file_range b. 64 mmap write threads + msync c. 32 write() threads + sync_file_range + 32 mmap write threads + msync d. Same as C, but with all threads in directio mode e. Same as A, but with all threads in directio mode f. Same as B, but with all threads in directio mode After some 44 hours of safety testing across 4 machines, I saw zero errors. Before the patchset, I could run any of A-F for 10 seconds or less and have a screen full of errors. To assess the performance impact of stable page writes, I moved to a disk that doesn't have DIF support so that I could measure just the impact of waiting for writeback. I first ran wac with 64 threads madly scribbling on a 64k file and saw about a 12% performance decrease. I then reran the wac program with 64 threads and a 64MB file and saw about the same performance numbers. I will of course be testing a wider range of hardware now that I have a functioning patch set, though as I suspected the patchset only seems to impact workloads that rewrite the same memory page frequently. As always, questions and comments are welcome; and thank you to all the previous reviewers of this patchset! --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 2011-05-04 17:37 ` [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 Darrick J. Wong @ 2011-05-04 18:46 ` Christoph Hellwig 2011-05-04 19:21 ` Chris Mason 0 siblings, 1 reply; 67+ messages in thread From: Christoph Hellwig @ 2011-05-04 18:46 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm This seems to miss out on a lot of the generic functionality like write_cache_pages and block_page_mkwrite and just patch it into the ext4 copy & paste variants. Please make sure your patches also work for filesystem that use more of the generic functionality like xfs or ext2 (the latter one might be fun for the mmap case). Also what's the status of btrfs? I remembered there was one or two bits missing despite doing the right thing in most areas. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 2011-05-04 18:46 ` Christoph Hellwig @ 2011-05-04 19:21 ` Chris Mason 2011-05-04 20:00 ` Darrick J. Wong 2011-05-04 23:57 ` Darrick J. Wong 0 siblings, 2 replies; 67+ messages in thread From: Chris Mason @ 2011-05-04 19:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Darrick J. Wong, Theodore Ts'o, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400: > This seems to miss out on a lot of the generic functionality like > write_cache_pages and block_page_mkwrite and just patch it into > the ext4 copy & paste variants. Please make sure your patches also > work for filesystem that use more of the generic functionality like > xfs or ext2 (the latter one might be fun for the mmap case). Probably after the block_commit_write in block_page_mkwrite() Another question is, do we want to introduce a wait_on_stable_page_writeback()? This would allow us to add a check against the bdi requesting stable pages. > > Also what's the status of btrfs? I remembered there was one or two > bits missing despite doing the right thing in most areas. As far as I know btrfs is getting it right. The only bit missing is the one Nick Piggin pointed out where it is possible to change mmap'd O_DIRECT memory in flight while a DIO is in progress. Josef has a test case that demonstrates this. Nick had a plan to fix it, but it involved redoing the get_user_pages api. -chris ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 2011-05-04 19:21 ` Chris Mason @ 2011-05-04 20:00 ` Darrick J. Wong 2011-05-04 23:57 ` Darrick J. Wong 1 sibling, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 20:00 UTC (permalink / raw) To: Chris Mason Cc: Christoph Hellwig, Theodore Ts'o, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm On Wed, May 04, 2011 at 03:21:55PM -0400, Chris Mason wrote: > Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400: > > This seems to miss out on a lot of the generic functionality like > > write_cache_pages and block_page_mkwrite and just patch it into > > the ext4 copy & paste variants. Please make sure your patches also > > work for filesystem that use more of the generic functionality like > > xfs or ext2 (the latter one might be fun for the mmap case). > > Probably after the block_commit_write in block_page_mkwrite() Yes, I'm working on providing more generic fixes for ext3 & friends too, but they're not really working yet, so I was posting the parts that fix ext4, since they seem usable. > Another question is, do we want to introduce a wait_on_stable_page_writeback()? > > This would allow us to add a check against the bdi requesting stable > pages. Sounds like a good idea. > > Also what's the status of btrfs? I remembered there was one or two > > bits missing despite doing the right thing in most areas. > > As far as I know btrfs is getting it right. The only bit missing is the > one Nick Piggin pointed out where it is possible to change mmap'd O_DIRECT > memory in flight while a DIO is in progress. Josef has a test case that > demonstrates this. > > Nick had a plan to fix it, but it involved redoing the get_user_pages > api. I ran the same six tests A-F on btrfs and it reported -ENOSPC with 1% of the space used, though until it did that I didn't see any checksum errors. --D ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 2011-05-04 19:21 ` Chris Mason 2011-05-04 20:00 ` Darrick J. Wong @ 2011-05-04 23:57 ` Darrick J. Wong 2011-05-05 15:26 ` Jan Kara 1 sibling, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 23:57 UTC (permalink / raw) To: Chris Mason Cc: Christoph Hellwig, Theodore Ts'o, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm On Wed, May 04, 2011 at 03:21:55PM -0400, Chris Mason wrote: > Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400: > > This seems to miss out on a lot of the generic functionality like > > write_cache_pages and block_page_mkwrite and just patch it into > > the ext4 copy & paste variants. Please make sure your patches also > > work for filesystem that use more of the generic functionality like > > xfs or ext2 (the latter one might be fun for the mmap case). > > Probably after the block_commit_write in block_page_mkwrite() > Another question is, do we want to introduce a wait_on_stable_page_writeback()? Something like this here? It fixes block_page_mkwrite users and sticks in a simple page_mkwrite for fses that don't provide one at all. From a quick wac run it seems to make xfs work. ext2 seems to have some issues with modifying a buffer_head's bh_data without locking the bh during the update, so I guess it needs some review. --D -- fs: Modify/provide generic writepage/page_mkwrite functions to wait for writeback Modify the generic writepage function, and add an empty page_mkwrite function, to wait for page writeback to finish before allowing writes. This is so that simple filesystems have stable pages during write operations. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- fs/buffer.c | 1 + mm/filemap.c | 10 ++++++++++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index a08bb8e..cf9a795 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, if (!ret) ret = block_commit_write(page, 0, end); + wait_on_page_writeback(page); if (unlikely(ret)) { unlock_page(page); if (ret == -ENOMEM) diff --git a/mm/filemap.c b/mm/filemap.c index c22675f..9cb4e51 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1713,8 +1713,18 @@ page_not_uptodate: } EXPORT_SYMBOL(filemap_fault); +static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct page *page = vmf->page; + + lock_page(page); + wait_on_page_writeback(page); + return VM_FAULT_LOCKED; +} + const struct vm_operations_struct generic_file_vm_ops = { .fault = filemap_fault, + .page_mkwrite = empty_page_mkwrite, }; /* This is used for a general mmap of a disk file */ -- Here's the beginning of a patch against ext2 also. It fixes most of the checksum errors when metadata writes are in progress, though I suspect there are more spots that I haven't caught yet. -- ext2: Lock buffer_heads during metadata update ext2 does not protect buffer head that represent metadata against modifications during write operations. This is a problem if the memory buffer needs to be stable during writes; I think there are a few more spots in ext2 that need this treatment. :) Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- fs/ext2/inode.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 788e09a..5314b0b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1426,6 +1426,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync) if (IS_ERR(raw_inode)) return -EIO; + lock_buffer(bh); /* For fields not not tracking in the in-memory inode, * initialise them to zero for new inodes. */ if (ei->i_state & EXT2_STATE_NEW) @@ -1502,6 +1503,8 @@ static int __ext2_write_inode(struct inode *inode, int do_sync) } } else for (n = 0; n < EXT2_N_BLOCKS; n++) raw_inode->i_block[n] = ei->i_data[n]; + unlock_buffer(bh); + mark_buffer_dirty(bh); if (do_sync) { sync_dirty_buffer(bh); ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 2011-05-04 23:57 ` Darrick J. Wong @ 2011-05-05 15:26 ` Jan Kara 0 siblings, 0 replies; 67+ messages in thread From: Jan Kara @ 2011-05-05 15:26 UTC (permalink / raw) To: Darrick J. Wong Cc: Chris Mason, Christoph Hellwig, Theodore Ts'o, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm Hello, On Wed 04-05-11 16:57:06, Darrick J. Wong wrote: > On Wed, May 04, 2011 at 03:21:55PM -0400, Chris Mason wrote: > > Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400: > > > This seems to miss out on a lot of the generic functionality like > > > write_cache_pages and block_page_mkwrite and just patch it into > > > the ext4 copy & paste variants. Please make sure your patches also > > > work for filesystem that use more of the generic functionality like > > > xfs or ext2 (the latter one might be fun for the mmap case). > > > > Probably after the block_commit_write in block_page_mkwrite() > > Another question is, do we want to introduce a wait_on_stable_page_writeback()? > > Something like this here? It fixes block_page_mkwrite users and sticks in a > simple page_mkwrite for fses that don't provide one at all. From a quick wac > run it seems to make xfs work. ext2 seems to have some issues with modifying a > buffer_head's bh_data without locking the bh during the update, so I guess it > needs some review. Yes, ext2 is rather difficult because of all the metadata updates to buffers happening. That would need a serious work I suspect. > fs: Modify/provide generic writepage/page_mkwrite functions to wait for writeback > > Modify the generic writepage function, and add an empty page_mkwrite function, > to wait for page writeback to finish before allowing writes. This is so that > simple filesystems have stable pages during write operations. > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > > fs/buffer.c | 1 + > mm/filemap.c | 10 ++++++++++ > 2 files changed, 11 insertions(+), 0 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index a08bb8e..cf9a795 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > if (!ret) > ret = block_commit_write(page, 0, end); > > + wait_on_page_writeback(page); > if (unlikely(ret)) { > unlock_page(page); > if (ret == -ENOMEM) > diff --git a/mm/filemap.c b/mm/filemap.c > index c22675f..9cb4e51 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1713,8 +1713,18 @@ page_not_uptodate: > } > EXPORT_SYMBOL(filemap_fault); > > +static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + struct page *page = vmf->page; > + > + lock_page(page); > + wait_on_page_writeback(page); > + return VM_FAULT_LOCKED; > +} > + I guess you miss the whether the page has been truncated here (in which case you should return VM_FAULT_NOPAGE). > const struct vm_operations_struct generic_file_vm_ops = { > .fault = filemap_fault, > + .page_mkwrite = empty_page_mkwrite, > }; > > /* This is used for a general mmap of a disk file */ Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 1/3] ext4: Clean up some wait_on_page_writeback calls 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong 2011-04-22 12:50 ` Chris Mason 2011-05-04 17:37 ` [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 Darrick J. Wong @ 2011-05-04 17:39 ` Darrick J. Wong 2011-05-04 17:41 ` [PATCH v3 2/3] ext4: Wait for writeback to complete while making pages writable Darrick J. Wong 2011-05-04 17:42 ` [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write Darrick J. Wong 4 siblings, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 17:39 UTC (permalink / raw) To: Theodore Ts'o Cc: Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm wait_on_page_writeback already checks the writeback bit, so callers of it needn't do that test. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- fs/ext4/inode.c | 4 +--- fs/ext4/move_extent.c | 3 +-- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f2fa5e8..3db34b2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2796,9 +2796,7 @@ static int write_cache_pages_da(struct address_space *mapping, continue; } - if (PageWriteback(page)) - wait_on_page_writeback(page); - + wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); if (mpd->next_page != page->index) diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index d5c5783..d1548b1 100644 --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -876,8 +876,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode, * It needs to call wait_on_page_writeback() to wait for the * writeback of the page. */ - if (PageWriteback(page)) - wait_on_page_writeback(page); + wait_on_page_writeback(page); /* Release old bh and drop refs */ try_to_release_page(page, 0); ^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v3 2/3] ext4: Wait for writeback to complete while making pages writable 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong ` (2 preceding siblings ...) 2011-05-04 17:39 ` [PATCH v3 1/3] ext4: Clean up some wait_on_page_writeback calls Darrick J. Wong @ 2011-05-04 17:41 ` Darrick J. Wong 2011-05-04 17:42 ` [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write Darrick J. Wong 4 siblings, 0 replies; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 17:41 UTC (permalink / raw) To: Theodore Ts'o Cc: Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm, linux-ext4 In order to stabilize pages during disk writes, ext4_page_mkwrite must wait for writeback operations to complete before making a page writable. Furthermore, the function must return locked pages, and recheck the writeback status if the page lock is ever dropped. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- fs/ext4/inode.c | 24 +++++++++++++++++++----- 1 files changed, 19 insertions(+), 5 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3db34b2..1d162a2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5809,15 +5809,19 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) goto out_unlock; } ret = 0; - if (PageMappedToDisk(page)) - goto out_unlock; + + lock_page(page); + wait_on_page_writeback(page); + if (PageMappedToDisk(page)) { + up_read(&inode->i_alloc_sem); + return VM_FAULT_LOCKED; + } if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; - lock_page(page); /* * return if we have all the buffers mapped. This avoid * the need to call write_begin/write_end which does a @@ -5827,8 +5831,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - unlock_page(page); - goto out_unlock; + up_read(&inode->i_alloc_sem); + return VM_FAULT_LOCKED; } } unlock_page(page); @@ -5848,6 +5852,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (ret < 0) goto out_unlock; ret = 0; + + /* + * write_begin/end might have created a dirty page and someone + * could wander in and start the IO. Make sure that hasn't + * happened. + */ + lock_page(page); + wait_on_page_writeback(page); + up_read(&inode->i_alloc_sem); + return VM_FAULT_LOCKED; out_unlock: if (ret) ret = VM_FAULT_SIGBUS; ^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong ` (3 preceding siblings ...) 2011-05-04 17:41 ` [PATCH v3 2/3] ext4: Wait for writeback to complete while making pages writable Darrick J. Wong @ 2011-05-04 17:42 ` Darrick J. Wong 2011-05-04 18:48 ` Christoph Hellwig 4 siblings, 1 reply; 67+ messages in thread From: Darrick J. Wong @ 2011-05-04 17:42 UTC (permalink / raw) To: Theodore Ts'o Cc: Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm, linux-ext4 When grabbing a page for a buffered IO write, the mm should wait for writeback on the page to complete so that the page does not become writable during the IO operation. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- mm/filemap.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index c641edf..c22675f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2287,8 +2287,10 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, gfp_notmask = __GFP_FS; repeat: page = find_lock_page(mapping, index); - if (page) + if (page) { + wait_on_page_writeback(page); return page; + } page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask); if (!page) @@ -2301,6 +2303,7 @@ repeat: goto repeat; return NULL; } + wait_on_page_writeback(page); return page; } EXPORT_SYMBOL(grab_cache_page_write_begin); ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write 2011-05-04 17:42 ` [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write Darrick J. Wong @ 2011-05-04 18:48 ` Christoph Hellwig 0 siblings, 0 replies; 67+ messages in thread From: Christoph Hellwig @ 2011-05-04 18:48 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, Christoph Hellwig, Chris Mason, Jeff Layton, Jan Kara, Dave Chinner, Joel Becker, Martin K. Petersen, Jens Axboe, linux-kernel, linux-fsdevel, Mingming Cao, linux-scsi, Dave Hansen, linux-mm, linux-ext4 On Wed, May 04, 2011 at 10:42:27AM -0700, Darrick J. Wong wrote: > When grabbing a page for a buffered IO write, the mm should wait for writeback > on the page to complete so that the page does not become writable during the IO > operation. > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > > mm/filemap.c | 5 ++++- > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index c641edf..c22675f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2287,8 +2287,10 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, > gfp_notmask = __GFP_FS; > repeat: > page = find_lock_page(mapping, index); > - if (page) > + if (page) { > + wait_on_page_writeback(page); > return page; > + } goto found; > > page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask); > if (!page) > @@ -2301,6 +2303,7 @@ repeat: > goto repeat; > return NULL; > } found: > + wait_on_page_writeback(page); > return page; > } > EXPORT_SYMBOL(grab_cache_page_write_begin); ^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2011-05-05 15:26 UTC | newest] Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-02-22 2:00 [RFC] block integrity: Fix write after checksum calculation problem Darrick J. Wong 2011-02-22 5:45 ` Boaz Harrosh 2011-02-22 11:42 ` Jan Kara 2011-02-22 13:02 ` Chris Mason 2011-02-22 19:13 ` Boaz Harrosh 2011-03-04 20:51 ` Darrick J. Wong 2011-03-04 20:53 ` Christoph Hellwig 2011-02-22 16:13 ` Andreas Dilger 2011-02-22 16:40 ` Martin K. Petersen 2011-02-22 19:45 ` Darrick J. Wong 2011-02-22 22:53 ` Dave Chinner 2011-02-23 16:24 ` Martin K. Petersen 2011-02-23 23:47 ` Dave Chinner 2011-02-24 16:43 ` Jan Kara 2011-02-28 8:49 ` Christoph Hellwig 2011-02-22 16:45 ` Martin K. Petersen 2011-02-23 20:24 ` Joel Becker 2011-02-23 20:35 ` Chris Mason 2011-02-23 21:42 ` Joel Becker 2011-02-24 16:47 ` Jan Kara 2011-02-24 17:37 ` Chris Mason 2011-02-24 18:27 ` Darrick J. Wong 2011-02-28 12:54 ` Chris Mason 2011-03-04 21:07 ` Darrick J. Wong 2011-03-04 22:22 ` Andreas Dilger 2011-03-07 19:11 ` Darrick J. Wong 2011-03-07 21:12 ` Chris Mason 2011-03-08 4:56 ` Dave Chinner 2011-03-10 23:57 ` Darrick J. Wong 2011-03-11 16:34 ` Chris Mason 2011-03-11 18:51 ` Darrick J. Wong 2011-03-19 0:07 ` Darrick J. Wong 2011-03-19 2:28 ` Andreas Dilger 2011-03-22 19:23 ` Darrick J. Wong 2011-03-22 21:54 ` Jan Kara 2011-03-21 14:04 ` Jan Kara 2011-03-21 14:24 ` Chris Mason 2011-03-21 16:43 ` Jan Kara 2011-04-06 23:29 ` Darrick J. Wong 2011-04-07 16:44 ` Darrick J. Wong 2011-04-07 16:57 ` Jan Kara 2011-04-08 20:31 ` Darrick J. Wong 2011-04-11 16:42 ` Jeff Layton 2011-04-11 17:41 ` Chris Mason 2011-04-11 18:25 ` Christoph Hellwig 2011-04-11 18:38 ` Chris Mason 2011-04-12 0:46 ` Mingming Cao 2011-04-12 0:57 ` Christoph Hellwig 2011-04-14 0:48 ` Mingming Cao 2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong 2011-04-22 12:50 ` Chris Mason 2011-04-22 20:34 ` Jan Kara 2011-04-26 0:37 ` Darrick J. Wong 2011-04-26 11:33 ` Chris Mason 2011-05-03 1:59 ` Darrick J. Wong 2011-05-04 1:26 ` Darrick J. Wong 2011-04-26 11:37 ` Jan Kara 2011-05-04 17:37 ` [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 Darrick J. Wong 2011-05-04 18:46 ` Christoph Hellwig 2011-05-04 19:21 ` Chris Mason 2011-05-04 20:00 ` Darrick J. Wong 2011-05-04 23:57 ` Darrick J. Wong 2011-05-05 15:26 ` Jan Kara 2011-05-04 17:39 ` [PATCH v3 1/3] ext4: Clean up some wait_on_page_writeback calls Darrick J. Wong 2011-05-04 17:41 ` [PATCH v3 2/3] ext4: Wait for writeback to complete while making pages writable Darrick J. Wong 2011-05-04 17:42 ` [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write Darrick J. Wong 2011-05-04 18:48 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).