* 2.5 O)DIRECT problem @ 2002-10-04 20:17 Steve Lord 2002-10-04 20:29 ` Andrew Morton 0 siblings, 1 reply; 4+ messages in thread From: Steve Lord @ 2002-10-04 20:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Hi Andrew, I ran into a problem with 2.5's O_DIRECT read path, generic_file_direct_IO usually ends up in generic_direct_IO this does bounds checking on the I/O and then flushes any cached data. Once we return to generic_file_direct_IO we unconditionally call invalidate_inode_pages2 if there are any cached pages. If we make a non-aligned O_DIRECT read call, the end result is we call invalidate_inode_pages2, but we do not do the filemap_fdatawrite, filemap_fdatawait calls. End result is we throw the buffered data away. Either the flush needs to happen before the bounds checks, or the invalidate should only be done on a successful write. It looks pretty hard to detect the latter case with the current structure, we can get EINVAL from the bounds check and possibly from an aligned, but invalid memory address being passed in. This is worse for xfs than for other filesystems since if we do the invalidate without the flush first, we end up with a delayed allocate extent in memory and no data to flush into it. Various spots in the filesystem dislike this. I can fix it inside xfs by doing my own flush and invalidate first, but the generic code should really be fixed. Thoughts? Steve -- Steve Lord voice: +1-651-683-3511 Principal Engineer, Filesystem Software email: lord@sgi.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.5 O)DIRECT problem 2002-10-04 20:17 2.5 O)DIRECT problem Steve Lord @ 2002-10-04 20:29 ` Andrew Morton 2002-10-04 20:38 ` Steve Lord 0 siblings, 1 reply; 4+ messages in thread From: Andrew Morton @ 2002-10-04 20:29 UTC (permalink / raw) To: Steve Lord; +Cc: Linux Kernel Steve Lord wrote: > > Hi Andrew, > > I ran into a problem with 2.5's O_DIRECT read path, > > generic_file_direct_IO usually ends up in generic_direct_IO > this does bounds checking on the I/O and then flushes any > cached data. > > Once we return to generic_file_direct_IO we unconditionally > call invalidate_inode_pages2 if there are any cached pages. > > If we make a non-aligned O_DIRECT read call, the end result is we > call invalidate_inode_pages2, but we do not do the filemap_fdatawrite, > filemap_fdatawait calls. End result is we throw the buffered data away. Well you could always switch to Linus' current BK tree, in which invalidate_inode_pages2() is a no-op (whoops). > Either the flush needs to happen before the bounds checks, or the > invalidate should only be done on a successful write. It looks > pretty hard to detect the latter case with the current structure, > we can get EINVAL from the bounds check and possibly from an > aligned, but invalid memory address being passed in. Yes I agree; let's just do the sync before any checks. I think it should be moved into generic_file_direct_IO(), because that's the place where the invalidation happens, yes? ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.5 O)DIRECT problem 2002-10-04 20:29 ` Andrew Morton @ 2002-10-04 20:38 ` Steve Lord 2002-10-04 20:51 ` Andrew Morton 0 siblings, 1 reply; 4+ messages in thread From: Steve Lord @ 2002-10-04 20:38 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel On Fri, 2002-10-04 at 15:29, Andrew Morton wrote: > Steve Lord wrote: > > Either the flush needs to happen before the bounds checks, or the > > invalidate should only be done on a successful write. It looks > > pretty hard to detect the latter case with the current structure, > > we can get EINVAL from the bounds check and possibly from an > > aligned, but invalid memory address being passed in. > > Yes I agree; let's just do the sync before any checks. > > I think it should be moved into generic_file_direct_IO(), > because that's the place where the invalidation happens, yes? OK, sounds good to me, I will let my tests churn away on that version and see what happens. I think something else is doing the same thing to me elsewhere, but it might well be an xfs specific case. Steve -- Steve Lord voice: +1-651-683-3511 Principal Engineer, Filesystem Software email: lord@sgi.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.5 O)DIRECT problem 2002-10-04 20:38 ` Steve Lord @ 2002-10-04 20:51 ` Andrew Morton 0 siblings, 0 replies; 4+ messages in thread From: Andrew Morton @ 2002-10-04 20:51 UTC (permalink / raw) To: Steve Lord; +Cc: Linux Kernel Steve Lord wrote: > > On Fri, 2002-10-04 at 15:29, Andrew Morton wrote: > > Steve Lord wrote: > > > Either the flush needs to happen before the bounds checks, or the > > > invalidate should only be done on a successful write. It looks > > > pretty hard to detect the latter case with the current structure, > > > we can get EINVAL from the bounds check and possibly from an > > > aligned, but invalid memory address being passed in. > > > > Yes I agree; let's just do the sync before any checks. > > > > I think it should be moved into generic_file_direct_IO(), > > because that's the place where the invalidation happens, yes? > > OK, sounds good to me, I will let my tests churn away on that > version and see what happens. I think something else is doing > the same thing to me elsewhere, but it might well be an xfs > specific case. I queued the below patch. BTW, I'm sitting on a patch from Badari which allows the direct-io code to perform 512-byte-aligned and multiple-of-512-byte-sized IO against a 4k blocksize filesystem. We make (reasonable) assumptions about the return value from get_block(): scale it from softblocksize up to sectors and then add an offset. And we do weird things with the ZERO_PAGE to cater for the case where the filesystem block is buffer_new() - use bits of the ZERO_PAGE to fill out the gaps in the BIOs to zero out bits of disk blocks. We're also currently requiring that the filesystem pass its backing block_device * into generic_direct_IO so we can run bdev_hardsectsize() at the right time, which I'm not 100% happy with. But relaxing the 4k alignment requirement has great value, so we'll persist with that. I'll include that in the next -mm; you may want to take a look at it. patch: If the alignment checks in generic_direct_IO() fail, we end up not forcing writeback of dirty pagecache pages, but we still run invalidate_inode_pages2(). The net result is that dirty pagecache gets incorrectly removed. I guess this will expose unwritten disk blocks. So move the sync up into generic_file_direct_IO(), where we perform the invalidation. So we know that pagecache and disk are in sync before we do anything else. fs/direct-io.c | 21 ++++++++++----------- 1 files changed, 10 insertions(+), 11 deletions(-) --- 2.5.40/fs/direct-io.c~direct-io-invalidation-fix Fri Oct 4 13:41:37 2002 +++ 2.5.40-akpm/fs/direct-io.c Fri Oct 4 13:41:37 2002 @@ -620,13 +620,11 @@ generic_direct_IO(int rw, struct inode * int seg; size_t size; unsigned long addr; - struct address_space *mapping = inode->i_mapping; unsigned blocksize_mask = (1 << inode->i_blkbits) - 1; ssize_t retval = -EINVAL; - if (offset & blocksize_mask) { + if (offset & blocksize_mask) goto out; - } /* Check the memory alignment. Blocks cannot straddle pages */ for (seg = 0; seg < nr_segs; seg++) { @@ -636,14 +634,6 @@ generic_direct_IO(int rw, struct inode * goto out; } - if (mapping->nrpages) { - retval = filemap_fdatawrite(mapping); - if (retval == 0) - retval = filemap_fdatawait(mapping); - if (retval) - goto out; - } - retval = direct_io_worker(rw, inode, iov, offset, nr_segs, get_blocks); out: return retval; @@ -656,8 +646,17 @@ generic_file_direct_IO(int rw, struct in struct address_space *mapping = inode->i_mapping; ssize_t retval; + if (mapping->nrpages) { + retval = filemap_fdatawrite(mapping); + if (retval == 0) + retval = filemap_fdatawait(mapping); + if (retval) + goto out; + } + retval = mapping->a_ops->direct_IO(rw, inode, iov, offset, nr_segs); if (inode->i_mapping->nrpages) invalidate_inode_pages2(inode->i_mapping); +out: return retval; } . ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2002-10-04 20:45 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-10-04 20:17 2.5 O)DIRECT problem Steve Lord 2002-10-04 20:29 ` Andrew Morton 2002-10-04 20:38 ` Steve Lord 2002-10-04 20:51 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).