* LSF/MM/BPF 2023 IOMAP conversion status update @ 2023-01-29 4:46 Luis Chamberlain 2023-01-29 5:06 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Luis Chamberlain @ 2023-01-29 4:46 UTC (permalink / raw) To: lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells Cc: Luis Chamberlain, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm One of the recurring themes that comes up at LSF is "iomap has little to no documentation, it is hard to use". I've only recently taken a little nose dive into it, and so I also can frankly admit to say I don't grok it well either yet. However, the *general* motivation and value is clear: avoiding the old ugly monster of struct buffer_head, and abstracting the page cache for non network filesystems, and that is because for network filesystems my understanding is that we have another side effort for that. We could go a bit down memory lane on prior attempts to kill the struct buffer_head evil demon from Linux, or why its evil, but I'm not sure if recapping that is useful at this point in time, let me know, I could do that if it helps if folks want to talk about this at LSF. For now I rather instead focus on sharing efforts to review where we are today on the effort towards conversion towards IOMAP for some of the major filesystems: https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_225 I'm hoping this *might* be useful to some, but I fear it may leave quite a bit of folks with more questions than answers as it did for me. And hence I figured that *this aspect of this topic* perhaps might be a good topic for LSF. The end goal would hopefully then be finally enabling us to document IOMAP API properly and helping with the whole conversion effort. My gatherings from this quick review of API evolution and use is that, XFS is *certainly* a first class citizen user. No surprise there if a lot of the effort came out from XFS. And even though btrfs now avoids the evil struct buffer_head monster, its use of the IOMAP API seems *dramatically* different than XFS, and it probably puzzles many. Is it that btrfs managed to just get rid of struct buffer_head use but missed fully abstracting working with the page cache? How does one check? What semantics do we look for? When looking to see if one can help on the conversion front with other filesystems it begs the question what is the correct real end goal. What should one strive for? And it also gets me wondering, if we wanted to abstract the page cache from scratch again, would we have done this a bit differently now? Are there lessons from the network filesystem side of things which can be shared? If so it gets me wondering if this instead should be about why that's a good idea and what should that look like. Perhaps fs/buffers.c could be converted to folios only, and be done with it. But would we be loosing out on something? What would that be? Luis ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-01-29 4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain @ 2023-01-29 5:06 ` Matthew Wilcox 2023-01-29 5:39 ` Luis Chamberlain ` (2 more replies) 2023-02-27 19:06 ` Darrick J. Wong 2023-03-01 16:59 ` Ritesh Harjani 2 siblings, 3 replies; 23+ messages in thread From: Matthew Wilcox @ 2023-01-29 5:06 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, Christoph Hellwig, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: > I'm hoping this *might* be useful to some, but I fear it may leave quite > a bit of folks with more questions than answers as it did for me. And > hence I figured that *this aspect of this topic* perhaps might be a good > topic for LSF. The end goal would hopefully then be finally enabling us > to document IOMAP API properly and helping with the whole conversion > effort. +1 from me. I've made a couple of abortive efforts to try and convert a "trivial" filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on what the semantics are for get_block_t and iomap_begin(). > Perhaps fs/buffers.c could be converted to folios only, and be done > with it. But would we be loosing out on something? What would that be? buffer_heads are inefficient for multi-page folios because some of the algorthims are O(n^2) for n being the number of buffers in a folio. It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in a 2MB folio, it's pretty sticky. Things like "Read I/O has completed on this buffer, can I mark the folio as Uptodate now?" For iomap, that's a scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512 allocations, looking at one bit in each BH before moving on to the next. Similarly for writeback, iirc. So +1 from me for a "How do we convert 35-ish block based filesystems from BHs to iomap for their buffered & direct IO paths". There's maybe a separate discussion to be had for "What should the API be for filesystems to access metadata on the block device" because I don't believe the page-cache based APIs are easy for fs authors to use. Maybe some related topics are "What testing should we require for some of these ancient filesystems?" "Whose job is it to convert these 35 filesystems anyway, can we just delete some of them?" "Is there a lower-performance but easier-to-implement API than iomap for old filesystems that only exist for compatibiity reasons?" ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-01-29 5:06 ` Matthew Wilcox @ 2023-01-29 5:39 ` Luis Chamberlain 2023-02-08 16:04 ` Jan Kara 2023-02-27 19:47 ` Darrick J. Wong 2 siblings, 0 replies; 23+ messages in thread From: Luis Chamberlain @ 2023-01-29 5:39 UTC (permalink / raw) To: Matthew Wilcox Cc: lsf-pc, Christoph Hellwig, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Sun, Jan 29, 2023 at 05:06:47AM +0000, Matthew Wilcox wrote: > There's maybe a > separate discussion to be had for "What should the API be for filesystems > to access metadata on the block device" because I don't believe the > page-cache based APIs are easy for fs authors to use. OK sure, and *why* that would be good, yes sure. Perhaps that should be dicussed first though as then it I think may be easier to possibly celebrate IOMAP. > Maybe some related topics are > "What testing should we require for some of these ancient filesystems?" Oh yes.... > "Whose job is it to convert these 35 filesystems anyway, can we just > delete some of them?" > "Is there a lower-performance but easier-to-implement API than iomap > for old filesystems that only exist for compatibiity reasons?" Yes... Luis ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-01-29 5:06 ` Matthew Wilcox 2023-01-29 5:39 ` Luis Chamberlain @ 2023-02-08 16:04 ` Jan Kara 2023-02-24 7:01 ` Zhang Yi ` (2 more replies) 2023-02-27 19:47 ` Darrick J. Wong 2 siblings, 3 replies; 23+ messages in thread From: Jan Kara @ 2023-02-08 16:04 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells, kbus @imap.suse.de>> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Sun 29-01-23 05:06:47, Matthew Wilcox wrote: > On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: > > I'm hoping this *might* be useful to some, but I fear it may leave quite > > a bit of folks with more questions than answers as it did for me. And > > hence I figured that *this aspect of this topic* perhaps might be a good > > topic for LSF. The end goal would hopefully then be finally enabling us > > to document IOMAP API properly and helping with the whole conversion > > effort. > > +1 from me. > > I've made a couple of abortive efforts to try and convert a "trivial" > filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on > what the semantics are for get_block_t and iomap_begin(). Yeah, I'd be also interested in this discussion. In particular as a maintainer of part of these legacy filesystems (ext2, udf, isofs). > > Perhaps fs/buffers.c could be converted to folios only, and be done > > with it. But would we be loosing out on something? What would that be? > > buffer_heads are inefficient for multi-page folios because some of the > algorthims are O(n^2) for n being the number of buffers in a folio. > It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in > a 2MB folio, it's pretty sticky. Things like "Read I/O has completed on > this buffer, can I mark the folio as Uptodate now?" For iomap, that's a > scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512 > allocations, looking at one bit in each BH before moving on to the next. > Similarly for writeback, iirc. > > So +1 from me for a "How do we convert 35-ish block based filesystems > from BHs to iomap for their buffered & direct IO paths". There's maybe a > separate discussion to be had for "What should the API be for filesystems > to access metadata on the block device" because I don't believe the > page-cache based APIs are easy for fs authors to use. Yeah, so the actual data paths should be relatively easy for these old filesystems as they usually don't do anything special (those that do - like reiserfs - are deprecated and to be removed). But for metadata we do need some convenience functions like - give me block of metadata at this block number, make it dirty / clean / uptodate (block granularity dirtying & uptodate state is absolute must for metadata, otherwise we'll have data corruption issues). From the more complex functionality we need stuff like: lock particular block of metadata (equivalent of buffer lock), track that this block is metadata for given inode so that it can be written on fsync(2). Then more fancy filesystems like ext4 also need to attach more private state to each metadata block but that needs to be dealt with on case-by-case basis anyway. > Maybe some related topics are > "What testing should we require for some of these ancient filesystems?" > "Whose job is it to convert these 35 filesystems anyway, can we just > delete some of them?" I would not certainly miss some more filesystems - like minix, sysv, ... But before really treatening to remove some of these ancient and long untouched filesystems, we should convert at least those we do care about. When there's precedent how simple filesystem conversion looks like, it is easier to argue about what to do with the ones we don't care about so much. > "Is there a lower-performance but easier-to-implement API than iomap > for old filesystems that only exist for compatibiity reasons?" As I wrote above, for metadata there ought to be something as otherwise it will be real pain (and no gain really). But I guess the concrete API only matterializes once we attempt a conversion of some filesystem like ext2. I'll try to have a look into that, at least the obvious preparatory steps like converting the data paths to iomap. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-02-08 16:04 ` Jan Kara @ 2023-02-24 7:01 ` Zhang Yi 2023-02-26 20:16 ` Ritesh Harjani 2023-02-27 19:26 ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong 2 siblings, 0 replies; 23+ messages in thread From: Zhang Yi @ 2023-02-24 7:01 UTC (permalink / raw) To: Luis Chamberlain, Jan Kara, Matthew Wilcox Cc: lsf-pc, Christoph Hellwig, David Howells, kbus @imap.suse.de>> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm, yi.zhang, guohanjun On 2023/2/9 0:04, Jan Kara wrote: > On Sun 29-01-23 05:06:47, Matthew Wilcox wrote: >> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: >>> I'm hoping this *might* be useful to some, but I fear it may leave quite >>> a bit of folks with more questions than answers as it did for me. And >>> hence I figured that *this aspect of this topic* perhaps might be a good >>> topic for LSF. The end goal would hopefully then be finally enabling us >>> to document IOMAP API properly and helping with the whole conversion >>> effort. >> >> +1 from me. >> >> I've made a couple of abortive efforts to try and convert a "trivial" >> filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on >> what the semantics are for get_block_t and iomap_begin(). > > Yeah, I'd be also interested in this discussion. In particular as a > maintainer of part of these legacy filesystems (ext2, udf, isofs). > >>> Perhaps fs/buffers.c could be converted to folios only, and be done >>> with it. But would we be loosing out on something? What would that be? >> >> buffer_heads are inefficient for multi-page folios because some of the >> algorthims are O(n^2) for n being the number of buffers in a folio. >> It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in >> a 2MB folio, it's pretty sticky. Things like "Read I/O has completed on >> this buffer, can I mark the folio as Uptodate now?" For iomap, that's a >> scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512 >> allocations, looking at one bit in each BH before moving on to the next. >> Similarly for writeback, iirc. >> >> So +1 from me for a "How do we convert 35-ish block based filesystems >> from BHs to iomap for their buffered & direct IO paths". There's maybe a >> separate discussion to be had for "What should the API be for filesystems >> to access metadata on the block device" because I don't believe the >> page-cache based APIs are easy for fs authors to use. > > Yeah, so the actual data paths should be relatively easy for these old > filesystems as they usually don't do anything special (those that do - like > reiserfs - are deprecated and to be removed). But for metadata we do need > some convenience functions like - give me block of metadata at this block > number, make it dirty / clean / uptodate (block granularity dirtying & > uptodate state is absolute must for metadata, otherwise we'll have data > corruption issues). From the more complex functionality we need stuff like: > lock particular block of metadata (equivalent of buffer lock), track that > this block is metadata for given inode so that it can be written on > fsync(2). Then more fancy filesystems like ext4 also need to attach more > private state to each metadata block but that needs to be dealt with on > case-by-case basis anyway. > Hello, all. I also interested in this topic, especially for the ext4 filesystem iomap conversion of buffered IO paths. And also for the discussion of the metadata APIs, current buffer_heads could lead to many potential problems and brings a lot of quality challenges to our products. I look forward to more discussion if I can attend offline. Thanks, Yi. >> Maybe some related topics are >> "What testing should we require for some of these ancient filesystems?" >> "Whose job is it to convert these 35 filesystems anyway, can we just >> delete some of them?" > > I would not certainly miss some more filesystems - like minix, sysv, ... > But before really treatening to remove some of these ancient and long > untouched filesystems, we should convert at least those we do care about. > When there's precedent how simple filesystem conversion looks like, it is > easier to argue about what to do with the ones we don't care about so much. > >> "Is there a lower-performance but easier-to-implement API than iomap >> for old filesystems that only exist for compatibiity reasons?" > > As I wrote above, for metadata there ought to be something as otherwise it > will be real pain (and no gain really). But I guess the concrete API only > matterializes once we attempt a conversion of some filesystem like ext2. > I'll try to have a look into that, at least the obvious preparatory steps > like converting the data paths to iomap. > > Honza > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-02-08 16:04 ` Jan Kara 2023-02-24 7:01 ` Zhang Yi @ 2023-02-26 20:16 ` Ritesh Harjani 2023-03-16 14:40 ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM) 2023-02-27 19:26 ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong 2 siblings, 1 reply; 23+ messages in thread From: Ritesh Harjani @ 2023-02-26 20:16 UTC (permalink / raw) To: Jan Kara, Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells, kbus @imap.suse.de>> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm Jan Kara <jack@suse.cz> writes: > On Sun 29-01-23 05:06:47, Matthew Wilcox wrote: >> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: >> > I'm hoping this *might* be useful to some, but I fear it may leave quite >> > a bit of folks with more questions than answers as it did for me. And >> > hence I figured that *this aspect of this topic* perhaps might be a good >> > topic for LSF. The end goal would hopefully then be finally enabling us >> > to document IOMAP API properly and helping with the whole conversion >> > effort. >> >> +1 from me. +1 from my end as well please. Currently I have also been working on adding subpage size dirty tracking support to iomap layer so that we don't have the write amplification problem in case of buffered writes for bs < ps systems [1]. This also improves the performance including in some real world usecases like postgres db workload. Now there are some further points that we would like to discuss on how to optimize/improve iomap dirty bitmap tracking for large folios. I can try to come up with some ideas in that regard so that we can discuss about these as well with others [2] [1]: https://lore.kernel.org/all/cover.1677428794.git.ritesh.list@gmail.com/ [2]: https://lore.kernel.org/all/20230130210113.opdvyliooizicrsk@rh-tp/ >> >> I've made a couple of abortive efforts to try and convert a "trivial" >> filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on >> what the semantics are for get_block_t and iomap_begin(). > > Yeah, I'd be also interested in this discussion. In particular as a > maintainer of part of these legacy filesystems (ext2, udf, isofs). > >> > Perhaps fs/buffers.c could be converted to folios only, and be done >> > with it. But would we be loosing out on something? What would that be? >> >> buffer_heads are inefficient for multi-page folios because some of the >> algorthims are O(n^2) for n being the number of buffers in a folio. >> It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in >> a 2MB folio, it's pretty sticky. Things like "Read I/O has completed on >> this buffer, can I mark the folio as Uptodate now?" For iomap, that's a >> scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512 >> allocations, looking at one bit in each BH before moving on to the next. >> Similarly for writeback, iirc. >> >> So +1 from me for a "How do we convert 35-ish block based filesystems >> from BHs to iomap for their buffered & direct IO paths". There's maybe a >> separate discussion to be had for "What should the API be for filesystems >> to access metadata on the block device" because I don't believe the >> page-cache based APIs are easy for fs authors to use. > > Yeah, so the actual data paths should be relatively easy for these old > filesystems as they usually don't do anything special (those that do - like > reiserfs - are deprecated and to be removed). But for metadata we do need > some convenience functions like - give me block of metadata at this block > number, make it dirty / clean / uptodate (block granularity dirtying & > uptodate state is absolute must for metadata, otherwise we'll have data > corruption issues). From the more complex functionality we need stuff like: > lock particular block of metadata (equivalent of buffer lock), track that > this block is metadata for given inode so that it can be written on > fsync(2). Then more fancy filesystems like ext4 also need to attach more > private state to each metadata block but that needs to be dealt with on > case-by-case basis anyway. > >> Maybe some related topics are >> "What testing should we require for some of these ancient filesystems?" >> "Whose job is it to convert these 35 filesystems anyway, can we just >> delete some of them?" > > I would not certainly miss some more filesystems - like minix, sysv, ... > But before really treatening to remove some of these ancient and long > untouched filesystems, we should convert at least those we do care about. > When there's precedent how simple filesystem conversion looks like, it is > easier to argue about what to do with the ones we don't care about so much. > >> "Is there a lower-performance but easier-to-implement API than iomap >> for old filesystems that only exist for compatibiity reasons?" > > As I wrote above, for metadata there ought to be something as otherwise it > will be real pain (and no gain really). But I guess the concrete API only > matterializes once we attempt a conversion of some filesystem like ext2. > I'll try to have a look into that, at least the obvious preparatory steps > like converting the data paths to iomap. I have worked in past with Jan and others on adding iomap support to DIO for ext4. I have also added fiemap, bmap and swap operations to move to iomap for ext4 and I would like to continue working in this direction for making ext4 completely switch to iomap (including the buffered-io path). I would definitely like to convert ext2 to iomap (work along with you on this of course :)). This in my opinion too, can help us figure out whether iomap requires any changes so as to support ext* family of filesystem specially for buffered io path. This IMO is simpler then to start with ext4 buffered-io path which makes fscrypt, fsverity etc. get in our way... Let me also take a look at this and get back to you. At first I will get started with ext2 DIO path (which I think should be very straight forward, since we already have iomap_ops for DAX in ext2). While at it, I will also see what is required for the buffered I/O path conversion. [3]: https://lore.kernel.org/all/?q=s%3Aiomap+and+f%3Aritesh -ritesh ^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-02-26 20:16 ` Ritesh Harjani @ 2023-03-16 14:40 ` Ritesh Harjani (IBM) 2023-03-16 15:41 ` Darrick J. Wong ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Ritesh Harjani (IBM) @ 2023-03-16 14:40 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc, Ritesh Harjani (IBM) [DO NOT MERGE] [WORK-IN-PROGRESS] Hello Jan, This is an initial version of the patch set which I wanted to share before today's call. This is still work in progress but atleast passes the set of test cases which I had kept for dio testing (except 1 from my list). Looks like there won't be much/any changes required from iomap side to support ext2 moving to iomap apis. I will be doing some more testing specifically test generic/083 which is occassionally failing in my testing. Also once this is stabilized, I can do some performance testing too if you feel so. Last I remembered we saw some performance regressions when ext4 moved to iomap for dio. PS: Please ignore if there are some silly mistakes. As I said, I wanted to get this out before today's discussion. :) Thanks for your help!! Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/ext2/ext2.h | 1 + fs/ext2/file.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext2/inode.c | 20 +-------- 3 files changed, 117 insertions(+), 18 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index cb78d7dcfb95..cb5e309fe040 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned); extern struct inode *ext2_iget (struct super_block *, unsigned long); extern int ext2_write_inode (struct inode *, struct writeback_control *); extern void ext2_evict_inode(struct inode *); +extern void ext2_write_failed(struct address_space *mapping, loff_t to); extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int); extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *); extern int ext2_getattr (struct mnt_idmap *, const struct path *, diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 6b4bebe982ca..7a8561304559 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) return ret; } +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file->f_mapping->host; + ssize_t ret; + + inode_lock_shared(inode); + ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0); + inode_unlock_shared(inode); + + return ret; +} + +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size, + int error, unsigned int flags) +{ + loff_t pos = iocb->ki_pos; + struct inode *inode = file_inode(iocb->ki_filp); + + if (error) + return error; + + pos += size; + if (pos > i_size_read(inode)) + i_size_write(inode, pos); + + return 0; +} + +static const struct iomap_dio_ops ext2_dio_write_ops = { + .end_io = ext2_dio_write_end_io, +}; + +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file->f_mapping->host; + ssize_t ret; + unsigned int flags; + unsigned long blocksize = inode->i_sb->s_blocksize; + loff_t offset = iocb->ki_pos; + loff_t count = iov_iter_count(from); + + + inode_lock(inode); + ret = generic_write_checks(iocb, from); + if (ret <= 0) + goto out_unlock; + ret = file_remove_privs(file); + if (ret) + goto out_unlock; + ret = file_update_time(file); + if (ret) + goto out_unlock; + + /* + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() + * calls for generic_write_sync in iomap_dio_complete(). + * Since ext2_fsync nmust be called w/o inode lock, + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() + * ourselves. + */ + flags = IOMAP_DIO_NOSYNC; + + /* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */ + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) + flags |= IOMAP_DIO_FORCE_WAIT; + + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, + flags, NULL, 0); + + if (ret == -ENOTBLK) + ret = 0; + + if (ret < 0 && ret != -EIOCBQUEUED) + ext2_write_failed(inode->i_mapping, offset + count); + + /* handle case for partial write or fallback to buffered write */ + if (ret >= 0 && iov_iter_count(from)) { + loff_t pos, endbyte; + ssize_t status; + ssize_t ret2; + + pos = iocb->ki_pos; + status = generic_perform_write(iocb, from); + if (unlikely(status < 0)) { + ret = status; + goto out_unlock; + } + endbyte = pos + status - 1; + ret2 = filemap_write_and_wait_range(inode->i_mapping, pos, + endbyte); + if (ret2 == 0) { + iocb->ki_pos = endbyte + 1; + ret += status; + invalidate_mapping_pages(inode->i_mapping, + pos >> PAGE_SHIFT, + endbyte >> PAGE_SHIFT); + } + } +out_unlock: + inode_unlock(inode); + if (ret > 0) + ret = generic_write_sync(iocb, ret); + return ret; +} + static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) { #ifdef CONFIG_FS_DAX if (IS_DAX(iocb->ki_filp->f_mapping->host)) return ext2_dax_read_iter(iocb, to); #endif + if (iocb->ki_flags & IOCB_DIRECT) + return ext2_dio_read_iter(iocb, to); + return generic_file_read_iter(iocb, to); } @@ -176,6 +287,9 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (IS_DAX(iocb->ki_filp->f_mapping->host)) return ext2_dax_write_iter(iocb, from); #endif + if (iocb->ki_flags & IOCB_DIRECT) + return ext2_dio_write_iter(iocb, from); + return generic_file_write_iter(iocb, from); } diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 26f135e7ffce..7ff669d0b6d2 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -56,7 +56,7 @@ static inline int ext2_inode_is_fast_symlink(struct inode *inode) static void ext2_truncate_blocks(struct inode *inode, loff_t offset); -static void ext2_write_failed(struct address_space *mapping, loff_t to) +void ext2_write_failed(struct address_space *mapping, loff_t to) { struct inode *inode = mapping->host; @@ -908,22 +908,6 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block) return generic_block_bmap(mapping,block,ext2_get_block); } -static ssize_t -ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter) -{ - struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - size_t count = iov_iter_count(iter); - loff_t offset = iocb->ki_pos; - ssize_t ret; - - ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block); - if (ret < 0 && iov_iter_rw(iter) == WRITE) - ext2_write_failed(mapping, offset + count); - return ret; -} - static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { @@ -946,7 +930,7 @@ const struct address_space_operations ext2_aops = { .write_begin = ext2_write_begin, .write_end = ext2_write_end, .bmap = ext2_bmap, - .direct_IO = ext2_direct_IO, + .direct_IO = noop_direct_IO, .writepages = ext2_writepages, .migrate_folio = buffer_migrate_folio, .is_partially_uptodate = block_is_partially_uptodate, -- 2.39.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-16 14:40 ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM) @ 2023-03-16 15:41 ` Darrick J. Wong 2023-03-20 16:11 ` Ritesh Harjani 2023-03-20 13:15 ` Christoph Hellwig 2023-03-20 17:51 ` Jan Kara 2 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2023-03-16 15:41 UTC (permalink / raw) To: Ritesh Harjani (IBM); +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc On Thu, Mar 16, 2023 at 08:10:29PM +0530, Ritesh Harjani (IBM) wrote: > [DO NOT MERGE] [WORK-IN-PROGRESS] > > Hello Jan, > > This is an initial version of the patch set which I wanted to share > before today's call. This is still work in progress but atleast passes > the set of test cases which I had kept for dio testing (except 1 from my > list). > > Looks like there won't be much/any changes required from iomap side to > support ext2 moving to iomap apis. > > I will be doing some more testing specifically test generic/083 which is > occassionally failing in my testing. > Also once this is stabilized, I can do some performance testing too if you > feel so. Last I remembered we saw some performance regressions when ext4 > moved to iomap for dio. > > PS: Please ignore if there are some silly mistakes. As I said, I wanted > to get this out before today's discussion. :) > > Thanks for your help!! > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/ext2/ext2.h | 1 + > fs/ext2/file.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext2/inode.c | 20 +-------- > 3 files changed, 117 insertions(+), 18 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index cb78d7dcfb95..cb5e309fe040 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned); > extern struct inode *ext2_iget (struct super_block *, unsigned long); > extern int ext2_write_inode (struct inode *, struct writeback_control *); > extern void ext2_evict_inode(struct inode *); > +extern void ext2_write_failed(struct address_space *mapping, loff_t to); > extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int); > extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *); > extern int ext2_getattr (struct mnt_idmap *, const struct path *, > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 6b4bebe982ca..7a8561304559 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > return ret; > } > > +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) > +{ > + struct file *file = iocb->ki_filp; > + struct inode *inode = file->f_mapping->host; > + ssize_t ret; > + > + inode_lock_shared(inode); > + ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0); > + inode_unlock_shared(inode); > + > + return ret; > +} > + > +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size, > + int error, unsigned int flags) > +{ > + loff_t pos = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > + > + if (error) > + return error; > + > + pos += size; > + if (pos > i_size_read(inode)) > + i_size_write(inode, pos); > + > + return 0; > +} > + > +static const struct iomap_dio_ops ext2_dio_write_ops = { > + .end_io = ext2_dio_write_end_io, > +}; > + > +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > +{ > + struct file *file = iocb->ki_filp; > + struct inode *inode = file->f_mapping->host; > + ssize_t ret; > + unsigned int flags; > + unsigned long blocksize = inode->i_sb->s_blocksize; > + loff_t offset = iocb->ki_pos; > + loff_t count = iov_iter_count(from); > + > + > + inode_lock(inode); > + ret = generic_write_checks(iocb, from); > + if (ret <= 0) > + goto out_unlock; > + ret = file_remove_privs(file); > + if (ret) > + goto out_unlock; > + ret = file_update_time(file); > + if (ret) > + goto out_unlock; kiocb_modified() instead of calling file_remove_privs? > + > + /* > + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() > + * calls for generic_write_sync in iomap_dio_complete(). > + * Since ext2_fsync nmust be called w/o inode lock, > + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() > + * ourselves. > + */ > + flags = IOMAP_DIO_NOSYNC; > + > + /* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */ > + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || > + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) > + flags |= IOMAP_DIO_FORCE_WAIT; > + > + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, > + flags, NULL, 0); > + > + if (ret == -ENOTBLK) > + ret = 0; > + > + if (ret < 0 && ret != -EIOCBQUEUED) > + ext2_write_failed(inode->i_mapping, offset + count); > + > + /* handle case for partial write or fallback to buffered write */ > + if (ret >= 0 && iov_iter_count(from)) { > + loff_t pos, endbyte; > + ssize_t status; > + ssize_t ret2; > + > + pos = iocb->ki_pos; > + status = generic_perform_write(iocb, from); > + if (unlikely(status < 0)) { > + ret = status; > + goto out_unlock; > + } > + endbyte = pos + status - 1; > + ret2 = filemap_write_and_wait_range(inode->i_mapping, pos, > + endbyte); > + if (ret2 == 0) { > + iocb->ki_pos = endbyte + 1; > + ret += status; > + invalidate_mapping_pages(inode->i_mapping, > + pos >> PAGE_SHIFT, > + endbyte >> PAGE_SHIFT); > + } > + } (Why not fall back to the actual buffered write path?) Otherwise this looks like a reasonable first start. --D > +out_unlock: > + inode_unlock(inode); > + if (ret > 0) > + ret = generic_write_sync(iocb, ret); > + return ret; > +} > + > static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > { > #ifdef CONFIG_FS_DAX > if (IS_DAX(iocb->ki_filp->f_mapping->host)) > return ext2_dax_read_iter(iocb, to); > #endif > + if (iocb->ki_flags & IOCB_DIRECT) > + return ext2_dio_read_iter(iocb, to); > + > return generic_file_read_iter(iocb, to); > } > > @@ -176,6 +287,9 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > if (IS_DAX(iocb->ki_filp->f_mapping->host)) > return ext2_dax_write_iter(iocb, from); > #endif > + if (iocb->ki_flags & IOCB_DIRECT) > + return ext2_dio_write_iter(iocb, from); > + > return generic_file_write_iter(iocb, from); > } > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 26f135e7ffce..7ff669d0b6d2 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -56,7 +56,7 @@ static inline int ext2_inode_is_fast_symlink(struct inode *inode) > > static void ext2_truncate_blocks(struct inode *inode, loff_t offset); > > -static void ext2_write_failed(struct address_space *mapping, loff_t to) > +void ext2_write_failed(struct address_space *mapping, loff_t to) > { > struct inode *inode = mapping->host; > > @@ -908,22 +908,6 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block) > return generic_block_bmap(mapping,block,ext2_get_block); > } > > -static ssize_t > -ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > -{ > - struct file *file = iocb->ki_filp; > - struct address_space *mapping = file->f_mapping; > - struct inode *inode = mapping->host; > - size_t count = iov_iter_count(iter); > - loff_t offset = iocb->ki_pos; > - ssize_t ret; > - > - ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block); > - if (ret < 0 && iov_iter_rw(iter) == WRITE) > - ext2_write_failed(mapping, offset + count); > - return ret; > -} > - > static int > ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) > { > @@ -946,7 +930,7 @@ const struct address_space_operations ext2_aops = { > .write_begin = ext2_write_begin, > .write_end = ext2_write_end, > .bmap = ext2_bmap, > - .direct_IO = ext2_direct_IO, > + .direct_IO = noop_direct_IO, > .writepages = ext2_writepages, > .migrate_folio = buffer_migrate_folio, > .is_partially_uptodate = block_is_partially_uptodate, > -- > 2.39.2 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-16 15:41 ` Darrick J. Wong @ 2023-03-20 16:11 ` Ritesh Harjani 0 siblings, 0 replies; 23+ messages in thread From: Ritesh Harjani @ 2023-03-20 16:11 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc "Darrick J. Wong" <djwong@kernel.org> writes: > On Thu, Mar 16, 2023 at 08:10:29PM +0530, Ritesh Harjani (IBM) wrote: >> [DO NOT MERGE] [WORK-IN-PROGRESS] >> >> Hello Jan, >> >> This is an initial version of the patch set which I wanted to share >> before today's call. This is still work in progress but atleast passes >> the set of test cases which I had kept for dio testing (except 1 from my >> list). >> >> Looks like there won't be much/any changes required from iomap side to >> support ext2 moving to iomap apis. >> >> I will be doing some more testing specifically test generic/083 which is >> occassionally failing in my testing. >> Also once this is stabilized, I can do some performance testing too if you >> feel so. Last I remembered we saw some performance regressions when ext4 >> moved to iomap for dio. >> >> PS: Please ignore if there are some silly mistakes. As I said, I wanted >> to get this out before today's discussion. :) >> >> Thanks for your help!! >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> --- >> fs/ext2/ext2.h | 1 + >> fs/ext2/file.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/ext2/inode.c | 20 +-------- >> 3 files changed, 117 insertions(+), 18 deletions(-) >> >> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h >> index cb78d7dcfb95..cb5e309fe040 100644 >> --- a/fs/ext2/ext2.h >> +++ b/fs/ext2/ext2.h >> @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned); >> extern struct inode *ext2_iget (struct super_block *, unsigned long); >> extern int ext2_write_inode (struct inode *, struct writeback_control *); >> extern void ext2_evict_inode(struct inode *); >> +extern void ext2_write_failed(struct address_space *mapping, loff_t to); >> extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int); >> extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *); >> extern int ext2_getattr (struct mnt_idmap *, const struct path *, >> diff --git a/fs/ext2/file.c b/fs/ext2/file.c >> index 6b4bebe982ca..7a8561304559 100644 >> --- a/fs/ext2/file.c >> +++ b/fs/ext2/file.c >> @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) >> return ret; >> } >> >> +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) >> +{ >> + struct file *file = iocb->ki_filp; >> + struct inode *inode = file->f_mapping->host; >> + ssize_t ret; >> + >> + inode_lock_shared(inode); >> + ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0); >> + inode_unlock_shared(inode); >> + >> + return ret; >> +} >> + >> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size, >> + int error, unsigned int flags) >> +{ >> + loff_t pos = iocb->ki_pos; >> + struct inode *inode = file_inode(iocb->ki_filp); >> + >> + if (error) >> + return error; >> + >> + pos += size; >> + if (pos > i_size_read(inode)) >> + i_size_write(inode, pos); >> + >> + return 0; >> +} >> + >> +static const struct iomap_dio_ops ext2_dio_write_ops = { >> + .end_io = ext2_dio_write_end_io, >> +}; >> + >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) >> +{ >> + struct file *file = iocb->ki_filp; >> + struct inode *inode = file->f_mapping->host; >> + ssize_t ret; >> + unsigned int flags; >> + unsigned long blocksize = inode->i_sb->s_blocksize; >> + loff_t offset = iocb->ki_pos; >> + loff_t count = iov_iter_count(from); >> + >> + >> + inode_lock(inode); >> + ret = generic_write_checks(iocb, from); >> + if (ret <= 0) >> + goto out_unlock; >> + ret = file_remove_privs(file); >> + if (ret) >> + goto out_unlock; >> + ret = file_update_time(file); >> + if (ret) >> + goto out_unlock; > > kiocb_modified() instead of calling file_remove_privs? Yes, looks likle it is a replacement for file_remove_privs and file_update_time(). >> + >> + /* >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() >> + * calls for generic_write_sync in iomap_dio_complete(). >> + * Since ext2_fsync nmust be called w/o inode lock, >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() >> + * ourselves. >> + */ >> + flags = IOMAP_DIO_NOSYNC; >> + >> + /* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */ >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) >> + flags |= IOMAP_DIO_FORCE_WAIT; >> + >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, >> + flags, NULL, 0); >> + >> + if (ret == -ENOTBLK) >> + ret = 0; >> + >> + if (ret < 0 && ret != -EIOCBQUEUED) >> + ext2_write_failed(inode->i_mapping, offset + count); >> + >> + /* handle case for partial write or fallback to buffered write */ >> + if (ret >= 0 && iov_iter_count(from)) { >> + loff_t pos, endbyte; >> + ssize_t status; >> + ssize_t ret2; >> + >> + pos = iocb->ki_pos; >> + status = generic_perform_write(iocb, from); >> + if (unlikely(status < 0)) { >> + ret = status; >> + goto out_unlock; >> + } >> + endbyte = pos + status - 1; >> + ret2 = filemap_write_and_wait_range(inode->i_mapping, pos, >> + endbyte); >> + if (ret2 == 0) { >> + iocb->ki_pos = endbyte + 1; >> + ret += status; >> + invalidate_mapping_pages(inode->i_mapping, >> + pos >> PAGE_SHIFT, >> + endbyte >> PAGE_SHIFT); >> + } >> + } > > (Why not fall back to the actual buffered write path?) > Because then we can handle everything related to DIO in ext4_dio_file_write() itself e.g. As per the semantics of DIO we should ensure that page-cache pages are written to disk and invalidated before returning (filemap_write_and_wait_range() and invalidate_mapping_pages()) > Otherwise this looks like a reasonable first start. Thanks! -ritesh ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-16 14:40 ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM) 2023-03-16 15:41 ` Darrick J. Wong @ 2023-03-20 13:15 ` Christoph Hellwig 2023-03-20 17:51 ` Jan Kara 2 siblings, 0 replies; 23+ messages in thread From: Christoph Hellwig @ 2023-03-20 13:15 UTC (permalink / raw) To: Ritesh Harjani (IBM); +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc On Thu, Mar 16, 2023 at 08:10:29PM +0530, Ritesh Harjani (IBM) wrote: > +extern void ext2_write_failed(struct address_space *mapping, loff_t to); Nit: please don't bother with extents for function prototypes. > +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size, > + int error, unsigned int flags) > +{ > + loff_t pos = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > + > + if (error) > + return error; > + > + pos += size; > + if (pos > i_size_read(inode)) > + i_size_write(inode, pos); Doesn't i_size_write need i_mutex protection? > + /* > + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() > + * calls for generic_write_sync in iomap_dio_complete(). > + * Since ext2_fsync nmust be called w/o inode lock, > + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() > + * ourselves. > + */ > + flags = IOMAP_DIO_NOSYNC; So we added IOMAP_DIO_NOSYNC for btrfs initially, but even btrfs did not manage to mak it work. I suspect the right thing here as well is to call __iomap_dio_rw and then separately after dropping the lock. Without that you for example can't take i_mutex in the end_io handler, which I think we need to do. We should then remove IOMAP_DIO_NOSYNC again. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-16 14:40 ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM) 2023-03-16 15:41 ` Darrick J. Wong 2023-03-20 13:15 ` Christoph Hellwig @ 2023-03-20 17:51 ` Jan Kara 2023-03-22 6:34 ` Ritesh Harjani 2 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2023-03-20 17:51 UTC (permalink / raw) To: Ritesh Harjani (IBM); +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc On Thu 16-03-23 20:10:29, Ritesh Harjani (IBM) wrote: > [DO NOT MERGE] [WORK-IN-PROGRESS] > > Hello Jan, > > This is an initial version of the patch set which I wanted to share > before today's call. This is still work in progress but atleast passes > the set of test cases which I had kept for dio testing (except 1 from my > list). > > Looks like there won't be much/any changes required from iomap side to > support ext2 moving to iomap apis. > > I will be doing some more testing specifically test generic/083 which is > occassionally failing in my testing. > Also once this is stabilized, I can do some performance testing too if you > feel so. Last I remembered we saw some performance regressions when ext4 > moved to iomap for dio. > > PS: Please ignore if there are some silly mistakes. As I said, I wanted > to get this out before today's discussion. :) > > Thanks for your help!! > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/ext2/ext2.h | 1 + > fs/ext2/file.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext2/inode.c | 20 +-------- > 3 files changed, 117 insertions(+), 18 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index cb78d7dcfb95..cb5e309fe040 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned); > extern struct inode *ext2_iget (struct super_block *, unsigned long); > extern int ext2_write_inode (struct inode *, struct writeback_control *); > extern void ext2_evict_inode(struct inode *); > +extern void ext2_write_failed(struct address_space *mapping, loff_t to); > extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int); > extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *); > extern int ext2_getattr (struct mnt_idmap *, const struct path *, > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 6b4bebe982ca..7a8561304559 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > return ret; > } > > +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) > +{ > + struct file *file = iocb->ki_filp; > + struct inode *inode = file->f_mapping->host; > + ssize_t ret; > + > + inode_lock_shared(inode); > + ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0); > + inode_unlock_shared(inode); > + > + return ret; > +} > + > +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size, > + int error, unsigned int flags) > +{ > + loff_t pos = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > + > + if (error) > + return error; > + I guess you should carry over here relevant bits of the comment from ext4_dio_write_end_io() explaining that doing i_size update here is necessary and actually safe. > + pos += size; > + if (pos > i_size_read(inode)) > + i_size_write(inode, pos); > + > + return 0; > +} > + > +static const struct iomap_dio_ops ext2_dio_write_ops = { > + .end_io = ext2_dio_write_end_io, > +}; > + > +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > +{ > + struct file *file = iocb->ki_filp; > + struct inode *inode = file->f_mapping->host; > + ssize_t ret; > + unsigned int flags; > + unsigned long blocksize = inode->i_sb->s_blocksize; > + loff_t offset = iocb->ki_pos; > + loff_t count = iov_iter_count(from); > + > + > + inode_lock(inode); > + ret = generic_write_checks(iocb, from); > + if (ret <= 0) > + goto out_unlock; > + ret = file_remove_privs(file); > + if (ret) > + goto out_unlock; > + ret = file_update_time(file); > + if (ret) > + goto out_unlock; > + > + /* > + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() > + * calls for generic_write_sync in iomap_dio_complete(). > + * Since ext2_fsync nmust be called w/o inode lock, > + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() > + * ourselves. > + */ > + flags = IOMAP_DIO_NOSYNC; Meh, this is kind of ugly and we should come up with something better for simple filesystems so that they don't have to play these games. Frankly, these days I doubt there's anybody really needing inode_lock in __generic_file_fsync(). Neither sync_mapping_buffers() nor sync_inode_metadata() need inode_lock for their self-consistency. So it is only about flushing more consistent set of metadata to disk when fsync(2) races with other write(2)s to the same file so after a crash we have higher chances of seeing some real state of the file. But I'm not sure it's really worth keeping for filesystems that are still using sync_mapping_buffers(). People that care about consistency after a crash have IMHO moved to other filesystems long ago. > + > + /* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */ ^^ or > + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || > + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) > + flags |= IOMAP_DIO_FORCE_WAIT; > + > + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, > + flags, NULL, 0); > + > + if (ret == -ENOTBLK) > + ret = 0; So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of blockdev_direct_IO(). Thus you have to implement that in your ext2_iomap_ops, in particular in iomap_begin... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-20 17:51 ` Jan Kara @ 2023-03-22 6:34 ` Ritesh Harjani 2023-03-23 11:30 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Ritesh Harjani @ 2023-03-22 6:34 UTC (permalink / raw) To: Jan Kara; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc Thanks Jan for your feedback! Jan Kara <jack@suse.cz> writes: > On Thu 16-03-23 20:10:29, Ritesh Harjani (IBM) wrote: >> [DO NOT MERGE] [WORK-IN-PROGRESS] >> >> Hello Jan, >> >> This is an initial version of the patch set which I wanted to share >> before today's call. This is still work in progress but atleast passes >> the set of test cases which I had kept for dio testing (except 1 from my >> list). >> >> Looks like there won't be much/any changes required from iomap side to >> support ext2 moving to iomap apis. >> >> I will be doing some more testing specifically test generic/083 which is >> occassionally failing in my testing. >> Also once this is stabilized, I can do some performance testing too if you >> feel so. Last I remembered we saw some performance regressions when ext4 >> moved to iomap for dio. >> >> PS: Please ignore if there are some silly mistakes. As I said, I wanted >> to get this out before today's discussion. :) >> >> Thanks for your help!! >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> --- >> fs/ext2/ext2.h | 1 + >> fs/ext2/file.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/ext2/inode.c | 20 +-------- >> 3 files changed, 117 insertions(+), 18 deletions(-) >> >> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h >> index cb78d7dcfb95..cb5e309fe040 100644 >> --- a/fs/ext2/ext2.h >> +++ b/fs/ext2/ext2.h >> @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned); >> extern struct inode *ext2_iget (struct super_block *, unsigned long); >> extern int ext2_write_inode (struct inode *, struct writeback_control *); >> extern void ext2_evict_inode(struct inode *); >> +extern void ext2_write_failed(struct address_space *mapping, loff_t to); >> extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int); >> extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *); >> extern int ext2_getattr (struct mnt_idmap *, const struct path *, >> diff --git a/fs/ext2/file.c b/fs/ext2/file.c >> index 6b4bebe982ca..7a8561304559 100644 >> --- a/fs/ext2/file.c >> +++ b/fs/ext2/file.c >> @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) >> return ret; >> } >> >> +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) >> +{ >> + struct file *file = iocb->ki_filp; >> + struct inode *inode = file->f_mapping->host; >> + ssize_t ret; >> + >> + inode_lock_shared(inode); >> + ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0); >> + inode_unlock_shared(inode); >> + >> + return ret; >> +} >> + >> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size, >> + int error, unsigned int flags) >> +{ >> + loff_t pos = iocb->ki_pos; >> + struct inode *inode = file_inode(iocb->ki_filp); >> + >> + if (error) >> + return error; >> + > > I guess you should carry over here relevant bits of the comment from > ext4_dio_write_end_io() explaining that doing i_size update here is > necessary and actually safe. > Yes, sure. Will do that in the next rev. >> + pos += size; >> + if (pos > i_size_read(inode)) >> + i_size_write(inode, pos); >> + >> + return 0; >> +} >> + >> +static const struct iomap_dio_ops ext2_dio_write_ops = { >> + .end_io = ext2_dio_write_end_io, >> +}; >> + >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) >> +{ >> + struct file *file = iocb->ki_filp; >> + struct inode *inode = file->f_mapping->host; >> + ssize_t ret; >> + unsigned int flags; >> + unsigned long blocksize = inode->i_sb->s_blocksize; >> + loff_t offset = iocb->ki_pos; >> + loff_t count = iov_iter_count(from); >> + >> + >> + inode_lock(inode); >> + ret = generic_write_checks(iocb, from); >> + if (ret <= 0) >> + goto out_unlock; >> + ret = file_remove_privs(file); >> + if (ret) >> + goto out_unlock; >> + ret = file_update_time(file); >> + if (ret) >> + goto out_unlock; >> + >> + /* >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() >> + * calls for generic_write_sync in iomap_dio_complete(). >> + * Since ext2_fsync nmust be called w/o inode lock, >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() >> + * ourselves. >> + */ >> + flags = IOMAP_DIO_NOSYNC; > > Meh, this is kind of ugly and we should come up with something better for > simple filesystems so that they don't have to play these games. Frankly, > these days I doubt there's anybody really needing inode_lock in > __generic_file_fsync(). Neither sync_mapping_buffers() nor > sync_inode_metadata() need inode_lock for their self-consistency. So it is > only about flushing more consistent set of metadata to disk when fsync(2) > races with other write(2)s to the same file so after a crash we have higher > chances of seeing some real state of the file. But I'm not sure it's really > worth keeping for filesystems that are still using sync_mapping_buffers(). > People that care about consistency after a crash have IMHO moved to other > filesystems long ago. > One way which hch is suggesting is to use __iomap_dio_rw() -> unlock inode -> call generic_write_sync(). I haven't yet worked on this part. Are you suggesting to rip of inode_lock from __generic_file_fsync()? Won't it have a much larger implications? >> + >> + /* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */ > ^^ or > Yup. will fix it. >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) >> + flags |= IOMAP_DIO_FORCE_WAIT; >> + >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, >> + flags, NULL, 0); >> + >> + if (ret == -ENOTBLK) >> + ret = 0; > > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of > blockdev_direct_IO(). Thus you have to implement that in your > ext2_iomap_ops, in particular in iomap_begin... > Aah yes. Thanks for pointing that out - ext2_iomap_begin() should have something like this - /* * We cannot fill holes in indirect tree based inodes as that could * expose stale data in the case of a crash. Use the magic error code * to fallback to buffered I/O. */ Also I think ext2_iomap_end() should also handle a case like in ext4 - /* * Check to see whether an error occurred while writing out the data to * the allocated blocks. If so, return the magic error code so that we * fallback to buffered I/O and attempt to complete the remainder of * the I/O. Any blocks that may have been allocated in preparation for * the direct I/O will be reused during buffered I/O. */ if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) return -ENOTBLK; I am wondering if we have testcases in xfstests which really tests these functionalities also or not? Let me give it a try... ... So I did and somehow couldn't find any testcase which fails w/o above changes. I also ran LTP dio tests (./runltp -f dio -d /mnt1/scratch/tmp/) with the current patches on ext2 and all the tests passed. So it seems LTP doesn't have tests to trigger stale data exposure problems. Do you know of any test case which could trigger this? Otherwise I can finish updating the patches and then work on writing some test cases to trigger this. Another query - We have this function ext2_iomap_end() (pasted below) which calls ext2_write_failed(). Here IMO two cases are possible - 1. written is 0. which means an error has occurred. In that case calling ext2_write_failed() make sense. 2. consider a case where written > 0 && written < length. (This is possible right?). In that case we still go and call ext2_write_failed(). This function will truncate the pagecache and disk blocks beyong i_size. Now we haven't yet updated inode->i_size (we do that in ->end_io which gets called in the end during completion) So that means it just removes everything. Then in ext2_dax_write_iter(), we might go and update inode->i_size to iocb->ki_pos including for short writes. This looks like it isn't consistent because earlier we had destroyed all the blocks for the short writes and we will be returning ret > 0 to the user saying these many bytes have been written. Again I haven't yet found a test case at least not in xfstests which can trigger this short writes. Let me know your thoughts on this. All of this lies on the fact that there can be a case where written > 0 && written < length. I will read more to see if this even happens or not. But I atleast wanted to capture this somewhere. static int ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { if (iomap->type == IOMAP_MAPPED && written < length && (flags & IOMAP_WRITE)) ext2_write_failed(inode->i_mapping, offset + length); return 0; } void ext2_write_failed(struct address_space *mapping, loff_t to) { struct inode *inode = mapping->host; if (to > inode->i_size) { truncate_pagecache(inode, inode->i_size); ext2_truncate_blocks(inode, inode->i_size); } } ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) <...> ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops); if (ret > 0 && iocb->ki_pos > i_size_read(inode)) { i_size_write(inode, iocb->ki_pos); mark_inode_dirty(inode); } <...> Another thing - In dax while truncating the inode i_size in ext2_setsize(), I think we don't properly call dax_zero_blocks() when we are trying to zero the last block beyond EOF. i.e. for e.g. it can be called with len as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with len = 0 and can bug_on at maxblocks == 0. I think it should be this. I will spend some more time analyzing this and also test it once against DAX paths. diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 7ff669d0b6d2..cc264b1e288c 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); if (IS_DAX(inode)) - error = dax_zero_range(inode, newsize, - PAGE_ALIGN(newsize) - newsize, NULL, - &ext2_iomap_ops); + error = dax_truncate_page(inode, newsize, NULL, + &ext2_iomap_ops); else error = block_truncate_page(inode->i_mapping, newsize, ext2_get_block); -ritesh > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-22 6:34 ` Ritesh Harjani @ 2023-03-23 11:30 ` Jan Kara 2023-03-23 13:19 ` Ritesh Harjani 2023-03-30 0:02 ` Christoph Hellwig 0 siblings, 2 replies; 23+ messages in thread From: Jan Kara @ 2023-03-23 11:30 UTC (permalink / raw) To: Ritesh Harjani; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc On Wed 22-03-23 12:04:01, Ritesh Harjani wrote: > Jan Kara <jack@suse.cz> writes: > >> + pos += size; > >> + if (pos > i_size_read(inode)) > >> + i_size_write(inode, pos); > >> + > >> + return 0; > >> +} > >> + > >> +static const struct iomap_dio_ops ext2_dio_write_ops = { > >> + .end_io = ext2_dio_write_end_io, > >> +}; > >> + > >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > >> +{ > >> + struct file *file = iocb->ki_filp; > >> + struct inode *inode = file->f_mapping->host; > >> + ssize_t ret; > >> + unsigned int flags; > >> + unsigned long blocksize = inode->i_sb->s_blocksize; > >> + loff_t offset = iocb->ki_pos; > >> + loff_t count = iov_iter_count(from); > >> + > >> + > >> + inode_lock(inode); > >> + ret = generic_write_checks(iocb, from); > >> + if (ret <= 0) > >> + goto out_unlock; > >> + ret = file_remove_privs(file); > >> + if (ret) > >> + goto out_unlock; > >> + ret = file_update_time(file); > >> + if (ret) > >> + goto out_unlock; > >> + > >> + /* > >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() > >> + * calls for generic_write_sync in iomap_dio_complete(). > >> + * Since ext2_fsync nmust be called w/o inode lock, > >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() > >> + * ourselves. > >> + */ > >> + flags = IOMAP_DIO_NOSYNC; > > > > Meh, this is kind of ugly and we should come up with something better for > > simple filesystems so that they don't have to play these games. Frankly, > > these days I doubt there's anybody really needing inode_lock in > > __generic_file_fsync(). Neither sync_mapping_buffers() nor > > sync_inode_metadata() need inode_lock for their self-consistency. So it is > > only about flushing more consistent set of metadata to disk when fsync(2) > > races with other write(2)s to the same file so after a crash we have higher > > chances of seeing some real state of the file. But I'm not sure it's really > > worth keeping for filesystems that are still using sync_mapping_buffers(). > > People that care about consistency after a crash have IMHO moved to other > > filesystems long ago. > > > > One way which hch is suggesting is to use __iomap_dio_rw() -> unlock > inode -> call generic_write_sync(). I haven't yet worked on this part. So I see two problems with what Christoph suggests: a) It is unfortunate API design to require trivial (and low maintenance) filesystem to do these relatively complex locking games. But this can be solved by providing appropriate wrapper for them I guess. b) When you unlock the inode, other stuff can happen with the inode. And e.g. i_size update needs to happen after IO is completed so filesystems would have to be taught to avoid say two racing expanding writes. That's IMHO really too much to ask. > Are you suggesting to rip of inode_lock from __generic_file_fsync()? > Won't it have a much larger implications? Yes and yes :). But inode writeback already happens from other paths without inode_lock so there's hardly any surprise there. sync_mapping_buffers() is impossible to "customize" by filesystems and the generic code is fine without inode_lock. So I have hard time imagining how any filesystem would really depend on inode_lock in this path (famous last words ;)). > >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || > >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) > >> + flags |= IOMAP_DIO_FORCE_WAIT; > >> + > >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, > >> + flags, NULL, 0); > >> + > >> + if (ret == -ENOTBLK) > >> + ret = 0; > > > > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of > > blockdev_direct_IO(). Thus you have to implement that in your > > ext2_iomap_ops, in particular in iomap_begin... > > > > Aah yes. Thanks for pointing that out - > ext2_iomap_begin() should have something like this - > /* > * We cannot fill holes in indirect tree based inodes as that could > * expose stale data in the case of a crash. Use the magic error code > * to fallback to buffered I/O. > */ > > Also I think ext2_iomap_end() should also handle a case like in ext4 - > > /* > * Check to see whether an error occurred while writing out the data to > * the allocated blocks. If so, return the magic error code so that we > * fallback to buffered I/O and attempt to complete the remainder of > * the I/O. Any blocks that may have been allocated in preparation for > * the direct I/O will be reused during buffered I/O. > */ > if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > return -ENOTBLK; > > > I am wondering if we have testcases in xfstests which really tests these > functionalities also or not? Let me give it a try... > ... So I did and somehow couldn't find any testcase which fails w/o > above changes. I guess we don't. It isn't that simple (but certainly possible) to test for stale data exposure... > Another query - > > We have this function ext2_iomap_end() (pasted below) > which calls ext2_write_failed(). > > Here IMO two cases are possible - > > 1. written is 0. which means an error has occurred. > In that case calling ext2_write_failed() make sense. > > 2. consider a case where written > 0 && written < length. > (This is possible right?). In that case we still go and call > ext2_write_failed(). This function will truncate the pagecache and disk > blocks beyong i_size. Now we haven't yet updated inode->i_size (we do > that in ->end_io which gets called in the end during completion) > So that means it just removes everything. > > Then in ext2_dax_write_iter(), we might go and update inode->i_size > to iocb->ki_pos including for short writes. This looks like it isn't > consistent because earlier we had destroyed all the blocks for the short > writes and we will be returning ret > 0 to the user saying these many > bytes have been written. > Again I haven't yet found a test case at least not in xfstests which > can trigger this short writes. Let me know your thoughts on this. > All of this lies on the fact that there can be a case where > written > 0 && written < length. I will read more to see if this even > happens or not. But I atleast wanted to capture this somewhere. So as far as I remember, direct IO writes as implemented in iomap are all-or-nothing (see iomap_dio_complete()). But it would be good to assert that in ext4 code to avoid surprises if the generic code changes. > Another thing - > In dax while truncating the inode i_size in ext2_setsize(), > I think we don't properly call dax_zero_blocks() when we are trying to > zero the last block beyond EOF. i.e. for e.g. it can be called with len > as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with > len = 0 and can bug_on at maxblocks == 0. How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will not call iomap_begin() at all if iter.len == 0. > I think it should be this. I will spend some more time analyzing this > and also test it once against DAX paths. > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 7ff669d0b6d2..cc264b1e288c 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) > inode_dio_wait(inode); > > if (IS_DAX(inode)) > - error = dax_zero_range(inode, newsize, > - PAGE_ALIGN(newsize) - newsize, NULL, > - &ext2_iomap_ops); > + error = dax_truncate_page(inode, newsize, NULL, > + &ext2_iomap_ops); > else > error = block_truncate_page(inode->i_mapping, > newsize, ext2_get_block); That being said this is indeed a nice cleanup. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-23 11:30 ` Jan Kara @ 2023-03-23 13:19 ` Ritesh Harjani 2023-03-30 0:02 ` Christoph Hellwig 1 sibling, 0 replies; 23+ messages in thread From: Ritesh Harjani @ 2023-03-23 13:19 UTC (permalink / raw) To: Jan Kara; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc Jan Kara <jack@suse.cz> writes: Hi Jan, > On Wed 22-03-23 12:04:01, Ritesh Harjani wrote: >> Jan Kara <jack@suse.cz> writes: >> >> + pos += size; >> >> + if (pos > i_size_read(inode)) >> >> + i_size_write(inode, pos); >> >> + >> >> + return 0; >> >> +} >> >> + >> >> +static const struct iomap_dio_ops ext2_dio_write_ops = { >> >> + .end_io = ext2_dio_write_end_io, >> >> +}; >> >> + >> >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) >> >> +{ >> >> + struct file *file = iocb->ki_filp; >> >> + struct inode *inode = file->f_mapping->host; >> >> + ssize_t ret; >> >> + unsigned int flags; >> >> + unsigned long blocksize = inode->i_sb->s_blocksize; >> >> + loff_t offset = iocb->ki_pos; >> >> + loff_t count = iov_iter_count(from); >> >> + >> >> + >> >> + inode_lock(inode); >> >> + ret = generic_write_checks(iocb, from); >> >> + if (ret <= 0) >> >> + goto out_unlock; >> >> + ret = file_remove_privs(file); >> >> + if (ret) >> >> + goto out_unlock; >> >> + ret = file_update_time(file); >> >> + if (ret) >> >> + goto out_unlock; >> >> + >> >> + /* >> >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() >> >> + * calls for generic_write_sync in iomap_dio_complete(). >> >> + * Since ext2_fsync nmust be called w/o inode lock, >> >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() >> >> + * ourselves. >> >> + */ >> >> + flags = IOMAP_DIO_NOSYNC; >> > >> > Meh, this is kind of ugly and we should come up with something better for >> > simple filesystems so that they don't have to play these games. Frankly, >> > these days I doubt there's anybody really needing inode_lock in >> > __generic_file_fsync(). Neither sync_mapping_buffers() nor >> > sync_inode_metadata() need inode_lock for their self-consistency. So it is >> > only about flushing more consistent set of metadata to disk when fsync(2) >> > races with other write(2)s to the same file so after a crash we have higher >> > chances of seeing some real state of the file. But I'm not sure it's really >> > worth keeping for filesystems that are still using sync_mapping_buffers(). >> > People that care about consistency after a crash have IMHO moved to other >> > filesystems long ago. >> > >> >> One way which hch is suggesting is to use __iomap_dio_rw() -> unlock >> inode -> call generic_write_sync(). I haven't yet worked on this part. > > So I see two problems with what Christoph suggests: > > a) It is unfortunate API design to require trivial (and low maintenance) > filesystem to do these relatively complex locking games. But this can > be solved by providing appropriate wrapper for them I guess. > > b) When you unlock the inode, other stuff can happen with the inode. And > e.g. i_size update needs to happen after IO is completed so filesystems > would have to be taught to avoid say two racing expanding writes. That's > IMHO really too much to ask. > yes, that's the reason I was not touching it and I thought I will get back to it once I figure out other things. >> Are you suggesting to rip of inode_lock from __generic_file_fsync()? >> Won't it have a much larger implications? > > Yes and yes :). But inode writeback already happens from other paths > without inode_lock so there's hardly any surprise there. > sync_mapping_buffers() is impossible to "customize" by filesystems and the > generic code is fine without inode_lock. So I have hard time imagining how > any filesystem would really depend on inode_lock in this path (famous last > words ;)). > Ok sure. I will spend sometime looking into this code and history. And if everything looks good, will rip off inode_lock() from __generic_file_fsync(). >> >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || >> >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) >> >> + flags |= IOMAP_DIO_FORCE_WAIT; >> >> + >> >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, >> >> + flags, NULL, 0); >> >> + >> >> + if (ret == -ENOTBLK) >> >> + ret = 0; >> > >> > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of >> > blockdev_direct_IO(). Thus you have to implement that in your >> > ext2_iomap_ops, in particular in iomap_begin... >> > >> >> Aah yes. Thanks for pointing that out - >> ext2_iomap_begin() should have something like this - >> /* >> * We cannot fill holes in indirect tree based inodes as that could >> * expose stale data in the case of a crash. Use the magic error code >> * to fallback to buffered I/O. >> */ >> >> Also I think ext2_iomap_end() should also handle a case like in ext4 - >> >> /* >> * Check to see whether an error occurred while writing out the data to >> * the allocated blocks. If so, return the magic error code so that we >> * fallback to buffered I/O and attempt to complete the remainder of >> * the I/O. Any blocks that may have been allocated in preparation for >> * the direct I/O will be reused during buffered I/O. >> */ >> if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) >> return -ENOTBLK; >> >> >> I am wondering if we have testcases in xfstests which really tests these >> functionalities also or not? Let me give it a try... >> ... So I did and somehow couldn't find any testcase which fails w/o >> above changes. > > I guess we don't. It isn't that simple (but certainly possible) to test for > stale data exposure... > Yes, I am currently working on writing a aio dio test case to simulate this. AFAIU, the stale data exposure problems with non-extent filesystem is because it doesn't support unwritten extents. So in such case, if we submit an aio-dio we on a hole (inside file i_size), then a racing buffered read can race with it. That is, if the block allocation happens via aio-dio then the buffered read can read stale data from that block which is within inode i_size. This is not a problem for extending file writes because we anyway update file i_size after the write is done (and aio-dio is also synchronous against extending file writes but that's is to avoid race against other aio-dio). And this is not a problem for fileystems which supports unwritten extent because we can always allocate an unwritten extent inside a hole and then a racing buffered read cannot read stale data (because unwritten extents returns 0). Then unwritten to written conversion can happen at the end once the data is already written to these blocks. >> Another query - >> >> We have this function ext2_iomap_end() (pasted below) >> which calls ext2_write_failed(). >> >> Here IMO two cases are possible - >> >> 1. written is 0. which means an error has occurred. >> In that case calling ext2_write_failed() make sense. >> >> 2. consider a case where written > 0 && written < length. >> (This is possible right?). In that case we still go and call >> ext2_write_failed(). This function will truncate the pagecache and disk >> blocks beyong i_size. Now we haven't yet updated inode->i_size (we do >> that in ->end_io which gets called in the end during completion) >> So that means it just removes everything. >> >> Then in ext2_dax_write_iter(), we might go and update inode->i_size >> to iocb->ki_pos including for short writes. This looks like it isn't >> consistent because earlier we had destroyed all the blocks for the short >> writes and we will be returning ret > 0 to the user saying these many >> bytes have been written. >> Again I haven't yet found a test case at least not in xfstests which >> can trigger this short writes. Let me know your thoughts on this. >> All of this lies on the fact that there can be a case where >> written > 0 && written < length. I will read more to see if this even >> happens or not. But I atleast wanted to capture this somewhere. > > So as far as I remember, direct IO writes as implemented in iomap are > all-or-nothing (see iomap_dio_complete()). But it would be good to assert > that in ext4 code to avoid surprises if the generic code changes. > Ok, I was talking about error paths. But I think I will park this one to go through later. >> Another thing - >> In dax while truncating the inode i_size in ext2_setsize(), >> I think we don't properly call dax_zero_blocks() when we are trying to >> zero the last block beyond EOF. i.e. for e.g. it can be called with len >> as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with >> len = 0 and can bug_on at maxblocks == 0. > > How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will > not call iomap_begin() at all if iter.len == 0. > It can. In dax_zero_range() if we pass len = 0, then iomap_iter() can get can get called with 0 len and it will call ->iomap_begin() with 0 len. Maybe a WARN_ON() would certainly help in iomap_iter() to check against 0 iter->len to be passed to ->iomap_begin() (Note iter->len and iter->iomap.length are different) >> I think it should be this. I will spend some more time analyzing this >> and also test it once against DAX paths. >> >> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c >> index 7ff669d0b6d2..cc264b1e288c 100644 >> --- a/fs/ext2/inode.c >> +++ b/fs/ext2/inode.c >> @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) >> inode_dio_wait(inode); >> >> if (IS_DAX(inode)) >> - error = dax_zero_range(inode, newsize, >> - PAGE_ALIGN(newsize) - newsize, NULL, >> - &ext2_iomap_ops); >> + error = dax_truncate_page(inode, newsize, NULL, >> + &ext2_iomap_ops); >> else >> error = block_truncate_page(inode->i_mapping, >> newsize, ext2_get_block); > > That being said this is indeed a nice cleanup. Sure. I will add this patch in the beginning of the next revision. -ritesh ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap 2023-03-23 11:30 ` Jan Kara 2023-03-23 13:19 ` Ritesh Harjani @ 2023-03-30 0:02 ` Christoph Hellwig 1 sibling, 0 replies; 23+ messages in thread From: Christoph Hellwig @ 2023-03-30 0:02 UTC (permalink / raw) To: Jan Kara; +Cc: Ritesh Harjani, linux-fsdevel, linux-mm, lsf-pc On Thu, Mar 23, 2023 at 12:30:30PM +0100, Jan Kara wrote: > > > > One way which hch is suggesting is to use __iomap_dio_rw() -> unlock > > inode -> call generic_write_sync(). I haven't yet worked on this part. > > So I see two problems with what Christoph suggests: > > a) It is unfortunate API design to require trivial (and low maintenance) > filesystem to do these relatively complex locking games. But this can > be solved by providing appropriate wrapper for them I guess. Well, the problem is that this "trivial" file systems have a pretty broken locking scheme for fsync. The legacy dio code gets around this by unlocking i_rwsem inside of __blockdev_direct_IO. Which force a particular and somewhat odd locking scheme on the callers, and severly limits it what it can do. > > Are you suggesting to rip of inode_lock from __generic_file_fsync()? > > Won't it have a much larger implications? > > Yes and yes :). But inode writeback already happens from other paths > without inode_lock so there's hardly any surprise there. > sync_mapping_buffers() is impossible to "customize" by filesystems and the > generic code is fine without inode_lock. So I have hard time imagining how > any filesystem would really depend on inode_lock in this path (famous last > words ;)). Not holding the inode lock in ->fsync would solve all of this indeed. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-02-08 16:04 ` Jan Kara 2023-02-24 7:01 ` Zhang Yi 2023-02-26 20:16 ` Ritesh Harjani @ 2023-02-27 19:26 ` Darrick J. Wong 2023-02-27 21:02 ` Matthew Wilcox 2 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2023-02-27 19:26 UTC (permalink / raw) To: Jan Kara Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells, kbus @imap.suse.de>> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Wed, Feb 08, 2023 at 05:04:22PM +0100, Jan Kara wrote: > On Sun 29-01-23 05:06:47, Matthew Wilcox wrote: > > On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: > > > I'm hoping this *might* be useful to some, but I fear it may leave quite > > > a bit of folks with more questions than answers as it did for me. And > > > hence I figured that *this aspect of this topic* perhaps might be a good > > > topic for LSF. The end goal would hopefully then be finally enabling us > > > to document IOMAP API properly and helping with the whole conversion > > > effort. > > > > +1 from me. > > > > I've made a couple of abortive efforts to try and convert a "trivial" > > filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on > > what the semantics are for get_block_t and iomap_begin(). > > Yeah, I'd be also interested in this discussion. In particular as a > maintainer of part of these legacy filesystems (ext2, udf, isofs). > > > > Perhaps fs/buffers.c could be converted to folios only, and be done > > > with it. But would we be loosing out on something? What would that be? > > > > buffer_heads are inefficient for multi-page folios because some of the > > algorthims are O(n^2) for n being the number of buffers in a folio. > > It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in > > a 2MB folio, it's pretty sticky. Things like "Read I/O has completed on > > this buffer, can I mark the folio as Uptodate now?" For iomap, that's a > > scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512 > > allocations, looking at one bit in each BH before moving on to the next. > > Similarly for writeback, iirc. > > > > So +1 from me for a "How do we convert 35-ish block based filesystems > > from BHs to iomap for their buffered & direct IO paths". There's maybe a > > separate discussion to be had for "What should the API be for filesystems > > to access metadata on the block device" because I don't believe the > > page-cache based APIs are easy for fs authors to use. > > Yeah, so the actual data paths should be relatively easy for these old > filesystems as they usually don't do anything special (those that do - like > reiserfs - are deprecated and to be removed). But for metadata we do need > some convenience functions like - give me block of metadata at this block > number, make it dirty / clean / uptodate (block granularity dirtying & > uptodate state is absolute must for metadata, otherwise we'll have data > corruption issues). From the more complex functionality we need stuff like: > lock particular block of metadata (equivalent of buffer lock), track that > this block is metadata for given inode so that it can be written on > fsync(2). Then more fancy filesystems like ext4 also need to attach more > private state to each metadata block but that needs to be dealt with on > case-by-case basis anyway. I reiterate a years-ago suggestion from Dave Chinner to reintroduce a (metadata-only) buffer cache for filesystems. xfs_buf already does everything you want -- read a buffer, mark it dirty, bwrite it to storage, lock it, unlock it, mark it stale, etc. Memory pages are tracked (and reclaimed) separately from the bdev page cache, which means filesystem authors don't need to figure out the pagecache APIs. Upside/downside: Userspace programs scribbling on a mounted block device lose some ability to screw over that mounted filesystem. tune2fs loses the ability to change ext4 labels. Things get harder when you want to feed buffers to jbd2, since IIRC jbd2 tracks its own state in the journal head. Some the pieces are similar to the xfs_buf_log_item (shadow buffers, list of items in transaction), but others (triggers) aren't. Given that jbd2 and ext4 both want to attach other bits of information, I suspect you'd have to let the filesystem allocate the struct fsbuf objects. OTOH, once you do this, I think it's shouldn't be hard to remove buffer heads from ext* except the crypt/verity files. > > Maybe some related topics are > > "What testing should we require for some of these ancient filesystems?" > > "Whose job is it to convert these 35 filesystems anyway, can we just > > delete some of them?" > > I would not certainly miss some more filesystems - like minix, sysv, ... > But before really treatening to remove some of these ancient and long > untouched filesystems, we should convert at least those we do care about. > When there's precedent how simple filesystem conversion looks like, it is > easier to argue about what to do with the ones we don't care about so much. AFAICT, most of those old filesystems are pretty simple -- they don't do the fancy cow and transformation things that modern filesystems do. My guess is that rewiring the buffered IO path wouldn't be that hard. Dealing with all the other bitrot (metadata buffer heads, lack of testing) is what's going to kill them or force them into becoming fuse servers that we can isolate to userspace. Maybe that isn't such a bad thing. > > "Is there a lower-performance but easier-to-implement API than iomap > > for old filesystems that only exist for compatibiity reasons?" > > As I wrote above, for metadata there ought to be something as otherwise it > will be real pain (and no gain really). But I guess the concrete API only > matterializes once we attempt a conversion of some filesystem like ext2. > I'll try to have a look into that, at least the obvious preparatory steps > like converting the data paths to iomap. willy's tried this. --D > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-02-27 19:26 ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong @ 2023-02-27 21:02 ` Matthew Wilcox 0 siblings, 0 replies; 23+ messages in thread From: Matthew Wilcox @ 2023-02-27 21:02 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells, kbus @imap.suse.de>> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Mon, Feb 27, 2023 at 11:26:17AM -0800, Darrick J. Wong wrote: > > As I wrote above, for metadata there ought to be something as otherwise it > > will be real pain (and no gain really). But I guess the concrete API only > > matterializes once we attempt a conversion of some filesystem like ext2. > > I'll try to have a look into that, at least the obvious preparatory steps > > like converting the data paths to iomap. > > willy's tried this. Well, I started on it. The first thing I looked at was trying to decide how to handle the equivalent of BH_boundary: https://lore.kernel.org/linux-fsdevel/20200320144014.3276-1-willy@infradead.org/ I'm no longer sure that's the right approach. Maybe we do want to have an IOMAP_F_BOUNDARY to indicate that the bio should be submitted before calling iomap_iter() again. But we're definitely still missing the piece where we start the I/O to the page by setting iop->read_bytes_pending to folio_size() and then subtract off each piece as we either zero it or the I/O completes. I have a bunch of other pagecache / fs work queued up for this cycle, so I wasn't planning on leaping back into this. But it's worth making sure hat people know about one of the problems we figured out three years ago: https://lore.kernel.org/linux-fsdevel/20200505190825.GB5694@magnolia/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-01-29 5:06 ` Matthew Wilcox 2023-01-29 5:39 ` Luis Chamberlain 2023-02-08 16:04 ` Jan Kara @ 2023-02-27 19:47 ` Darrick J. Wong 2023-02-27 20:24 ` Luis Chamberlain 2 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2023-02-27 19:47 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Sun, Jan 29, 2023 at 05:06:47AM +0000, Matthew Wilcox wrote: > On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: > > I'm hoping this *might* be useful to some, but I fear it may leave quite > > a bit of folks with more questions than answers as it did for me. And > > hence I figured that *this aspect of this topic* perhaps might be a good > > topic for LSF. The end goal would hopefully then be finally enabling us > > to document IOMAP API properly and helping with the whole conversion > > effort. > > +1 from me. > > I've made a couple of abortive efforts to try and convert a "trivial" > filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on > what the semantics are for get_block_t and iomap_begin(). Yup. I wrote about it a little bit here: https://lore.kernel.org/linux-fsdevel/Y%2Fz%2FJrV8qRhUcqE7@magnolia/T/#mda6c3175857d1e4cba88dca042fee030207df4f6 ...and promised that I'd get back to writeback. For buffered IO, iomap does things in a much different order than (I think) most filesystems. Traditionally, I think the order is i_rwsem -> mmap_invalidatelock(?) -> page lock -> get mapping. iomap relies on the callers to take i_rwsem, asks the filesystem for a mapping (with whatever locking that entails), and only then starts locking pagecache folios to operate on them. IOWs, involving the filesystem earlier in the process enables it to make better decisions about space allocations, which in turn should make things faster and less fragmenty. OTOH, it also means that we've learned the hard way that pagecache operations need a means to revalidate mappings to avoid write races. This applies both to the initial pagecache write and to scheduling writeback, but the mechanisms for each were developed separately and years apart. See iomap::validity_cookie and xfs_writepage_ctx::{data,cow}_seq for what I'm talking about. We (xfs developers) ought to figure out if these two mechanisms should be merged before more filesystems start using iomap for buffered io. I'd like to have a discussion about how to clean up and clarify the iomap interfaces, and a separate one about how to port the remaining 35+ filesystems. I don't know how exactly to split this into LSF sessions, other than to suggest at least two. If hch or dchinner show up, I also want to drag them into this. :) --D > > Perhaps fs/buffers.c could be converted to folios only, and be done > > with it. But would we be loosing out on something? What would that be? > > buffer_heads are inefficient for multi-page folios because some of the > algorthims are O(n^2) for n being the number of buffers in a folio. > It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in > a 2MB folio, it's pretty sticky. Things like "Read I/O has completed on > this buffer, can I mark the folio as Uptodate now?" For iomap, that's a > scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512 > allocations, looking at one bit in each BH before moving on to the next. > Similarly for writeback, iirc. > > So +1 from me for a "How do we convert 35-ish block based filesystems > from BHs to iomap for their buffered & direct IO paths". There's maybe a > separate discussion to be had for "What should the API be for filesystems > to access metadata on the block device" because I don't believe the > page-cache based APIs are easy for fs authors to use. > > Maybe some related topics are > "What testing should we require for some of these ancient filesystems?" > "Whose job is it to convert these 35 filesystems anyway, can we just > delete some of them?" > "Is there a lower-performance but easier-to-implement API than iomap > for old filesystems that only exist for compatibiity reasons?" > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-02-27 19:47 ` Darrick J. Wong @ 2023-02-27 20:24 ` Luis Chamberlain 0 siblings, 0 replies; 23+ messages in thread From: Luis Chamberlain @ 2023-02-27 20:24 UTC (permalink / raw) To: Darrick J. Wong Cc: Matthew Wilcox, lsf-pc, Christoph Hellwig, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Mon, Feb 27, 2023 at 11:47:24AM -0800, Darrick J. Wong wrote: > OTOH, it also means that we've learned the hard way that pagecache > operations need a means to revalidate mappings to avoid write races. > This applies both to the initial pagecache write and to scheduling > writeback, but the mechanisms for each were developed separately and > years apart. See iomap::validity_cookie and > xfs_writepage_ctx::{data,cow}_seq for what I'm talking about. > We (xfs developers) ought to figure out if these two mechanisms should > be merged before more filesystems start using iomap for buffered io. That puts a good yield notice to some conversion efforts, thanks, this already alone is very useulf. > I'd like to have a discussion about how to clean up and clarify the > iomap interfaces, and a separate one about how to port the remaining 35+ > filesystems. I don't know how exactly to split this into LSF sessions, > other than to suggest at least two. From a conversion perspective, ideally if it was obvious I think we should be able to do some of it ala coccinelle, but I have yet to see any remotely obvious pattern. And it makes me wonder, should we strive to make the conversion as painless / obvious ? Is that a good litmus for when we should be ready to start converting other filesystems? > If hch or dchinner show up, I also want to drag them into this. :) And here I've been thinking I had to go to Australia to see you all together. Luis ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-01-29 4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain 2023-01-29 5:06 ` Matthew Wilcox @ 2023-02-27 19:06 ` Darrick J. Wong 2023-02-27 19:58 ` Luis Chamberlain 2023-03-01 16:59 ` Ritesh Harjani 2 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2023-02-27 19:06 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: > One of the recurring themes that comes up at LSF is "iomap has little > to no documentation, it is hard to use". I've only recently taken a > little nose dive into it, and so I also can frankly admit to say I don't > grok it well either yet. However, the *general* motivation and value is clear: > avoiding the old ugly monster of struct buffer_head, and abstracting > the page cache for non network filesystems, and that is because for > network filesystems my understanding is that we have another side effort > for that. We could go a bit down memory lane on prior attempts to kill > the struct buffer_head evil demon from Linux, or why its evil, but I'm not > sure if recapping that is useful at this point in time, let me know, I could > do that if it helps if folks want to talk about this at LSF. I think there's so much to talk about WRT iomap that I also think we're not going to have time for recapping why buffer heads are awful. > For now I rather > instead focus on sharing efforts to review where we are today on the effort > towards conversion towards IOMAP for some of the major filesystems: > > https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_225 Ooh, slides, excellent! Slide 3 -- there are a lot of separate struct iomap_ops in XFS because we learned (somewhat late) that creating a separate ops structure (and functions) for each IO path / use case is much more sanity-preserving than multiplexing everything into one gigantic function. Now we have four different write codepaths in the kernel: extern const struct iomap_ops xfs_buffered_write_iomap_ops; extern const struct iomap_ops xfs_page_mkwrite_iomap_ops; These two use delayed allocation and use the pagecache for COW. extern const struct iomap_ops xfs_direct_write_iomap_ops; extern const struct iomap_ops xfs_dax_write_iomap_ops; These two don't use delayed allocation. IIRC the DAX ops also take care of pre-zeroing allocations since we assume that memcpy to pmem is relatively "cheap". Only one is needed to cover the read cases and FIEMAP: extern const struct iomap_ops xfs_read_iomap_ops; And then there's the weird ones for SEEK_{DATA,HOLE}: extern const struct iomap_ops xfs_seek_iomap_ops; And FIEMAP for xattrs: extern const struct iomap_ops xfs_xattr_iomap_ops; (I'll get to xfs_writeback_ops downthread.) > I'm hoping this *might* be useful to some, but I fear it may leave quite > a bit of folks with more questions than answers as it did for me. And > hence I figured that *this aspect of this topic* perhaps might be a good > topic for LSF. The end goal would hopefully then be finally enabling us > to document IOMAP API properly and helping with the whole conversion > effort. Heh. Yes. Absolutely yes. struct iomap_ops is a giant ball of frustration: int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap, struct iomap *srcmap); If you're a filesystem trying to use iomap, it's not at all clear when you're supposed to fill out @iomap and when you need to fill out @srcmap. Right now, the rule is that an ->iomap_begin implementation should always fill out @iomap. If the operation is a copy-on-write, only then does the ->iomap_begin function need to fill out @srcmap. I've been mulling over redesigning this so that the two parameters are @read_iomap and @write_iomap. Any time there's a place from which to read data, the ->iomap_begin function would fill out @read_iomap. This would also be the vessel for SEEK_*, FIEMAP, and swapfile activation. Any time there's a place in which to write data, the ->iomap_begin function would fill out @write_iomap. For a COW, the read and write mappings would be different. For a pure overwrite, they'd be the same. This is going to take a bit of thought to get right. It'll be easier for XFS because we split the iomap_ops to handle different responsibilities, but I haven't looked all that far into ext4 and btrfs to see what they do. I also want to convert the inode/pos/length/flags arguments to a const pointer to struct iomap_iter to reduce the register count. > My gatherings from this quick review of API evolution and use is that, > XFS is *certainly* a first class citizen user. No surprise there if a > lot of the effort came out from XFS. And even though btrfs now avoids > the evil struct buffer_head monster, its use of the IOMAP API seems > *dramatically* different than XFS, and it probably puzzles many. Is it > that btrfs managed to just get rid of struct buffer_head use but missed > fully abstracting working with the page cache? How does one check? What > semantics do we look for? I'm under the impression that for buffered io, btrfs manages pagecache folios directly, without the use of iomap at all. Hence no mention of buffer heads in the codebase. A big problem I see in porting btrfs/ext*/f2fs/etc to use the iomap pagecache code is that iomap doesn't support io permutations at all. It assumes that it can assemble struct bio with pagecache pages and issue that bio directly to the device. IOWs, someone will need to port fscrypt, fsverity, compression, etc. to iomap before those filesystems can jump off bufferheads. > When looking to see if one can help on the conversion front with other > filesystems it begs the question what is the correct real end goal. What > should one strive for? And it also gets me wondering, if we wanted to abstract > the page cache from scratch again, would we have done this a bit differently > now? Are there lessons from the network filesystem side of things which > can be shared? If so it gets me wondering if this instead should be > about why that's a good idea and what should that look like. All good questions to ask. > Perhaps fs/buffers.c could be converted to folios only, and be done > with it. But would we be loosing out on something? What would that be? Dirty secret here: I still don't understand what buffers.c does. ;) --D > Luis ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-02-27 19:06 ` Darrick J. Wong @ 2023-02-27 19:58 ` Luis Chamberlain 0 siblings, 0 replies; 23+ messages in thread From: Luis Chamberlain @ 2023-02-27 19:58 UTC (permalink / raw) To: Darrick J. Wong Cc: lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Mon, Feb 27, 2023 at 11:06:14AM -0800, Darrick J. Wong wrote: > On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote: > > One of the recurring themes that comes up at LSF is "iomap has little > > to no documentation, it is hard to use". I've only recently taken a > > little nose dive into it, and so I also can frankly admit to say I don't > > grok it well either yet. However, the *general* motivation and value is clear: > > avoiding the old ugly monster of struct buffer_head, and abstracting > > the page cache for non network filesystems, and that is because for > > network filesystems my understanding is that we have another side effort > > for that. We could go a bit down memory lane on prior attempts to kill > > the struct buffer_head evil demon from Linux, or why its evil, but I'm not > > sure if recapping that is useful at this point in time, let me know, I could > > do that if it helps if folks want to talk about this at LSF. > > I think there's so much to talk about WRT iomap that I also think we're > not going to have time for recapping why buffer heads are awful. Does this mean you'll finally be attending a conference (if invited)?! I have some private write ups on this as I reviewed *dozens* or threads on prior efforts to address this issue before, so I could try to condense some of this as a preamble write up before this session, should this session be a candidate discussion. > > For now I rather > > instead focus on sharing efforts to review where we are today on the effort > > towards conversion towards IOMAP for some of the major filesystems: > > > > https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_225 > > Ooh, slides, excellent! > > Slide 3 -- there are a lot of separate struct iomap_ops in XFS because > we learned (somewhat late) that creating a separate ops structure (and > functions) for each IO path / use case is much more sanity-preserving > than multiplexing everything into one gigantic function. This is of huge value. Is is interesting that this was found through experience, ie empircally, rather than early on, and it made me wonder if things could be re-done -- what would we do? If that was useful for XFS -- could this also be usefulf for other filesytems? Could implying this somehow on the IOMAP API be useful? Based on your experience *should* filesystem authors *strive* towards splitting up the paths as well for IOMAP use? > Now we have four different write codepaths in the kernel: > > extern const struct iomap_ops xfs_buffered_write_iomap_ops; > extern const struct iomap_ops xfs_page_mkwrite_iomap_ops; > > These two use delayed allocation and use the pagecache for COW. > > extern const struct iomap_ops xfs_direct_write_iomap_ops; > extern const struct iomap_ops xfs_dax_write_iomap_ops; > > These two don't use delayed allocation. IIRC the DAX ops also take care > of pre-zeroing allocations since we assume that memcpy to pmem is > relatively "cheap". > > Only one is needed to cover the read cases and FIEMAP: > > extern const struct iomap_ops xfs_read_iomap_ops; > > And then there's the weird ones for SEEK_{DATA,HOLE}: > > extern const struct iomap_ops xfs_seek_iomap_ops; > > And FIEMAP for xattrs: > > extern const struct iomap_ops xfs_xattr_iomap_ops; > > (I'll get to xfs_writeback_ops downthread.) > > > I'm hoping this *might* be useful to some, but I fear it may leave quite > > a bit of folks with more questions than answers as it did for me. And > > hence I figured that *this aspect of this topic* perhaps might be a good > > topic for LSF. The end goal would hopefully then be finally enabling us > > to document IOMAP API properly and helping with the whole conversion > > effort. > > Heh. Yes. Absolutely yes. > > struct iomap_ops is a giant ball of frustration: > > int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, > unsigned flags, struct iomap *iomap, > struct iomap *srcmap); > > If you're a filesystem trying to use iomap, it's not at all clear when > you're supposed to fill out @iomap and when you need to fill out > @srcmap. Right now, the rule is that an ->iomap_begin implementation > should always fill out @iomap. If the operation is a copy-on-write, > only then does the ->iomap_begin function need to fill out @srcmap. > > I've been mulling over redesigning this so that the two parameters are > @read_iomap and @write_iomap. Any time there's a place from which to > read data, the ->iomap_begin function would fill out @read_iomap. This > would also be the vessel for SEEK_*, FIEMAP, and swapfile activation. > > Any time there's a place in which to write data, the ->iomap_begin > function would fill out @write_iomap. For a COW, the read and write > mappings would be different. For a pure overwrite, they'd be the same. > > This is going to take a bit of thought to get right. It'll be easier > for XFS because we split the iomap_ops to handle different > responsibilities, but I haven't looked all that far into ext4 and btrfs > to see what they do. IMHO so long as those use-cases don't break we should be fine in re-thinking IOMAP now. Otherwise a change later would incur much more care and could regress. A few of us are the brink of helping with the conversion effort, but these questions seems a bit more important to address than just trying to reduce the struct buffer_head count on each filesystem. > I also want to convert the inode/pos/length/flags arguments to a const > pointer to struct iomap_iter to reduce the register count. Neat. > > My gatherings from this quick review of API evolution and use is that, > > XFS is *certainly* a first class citizen user. No surprise there if a > > lot of the effort came out from XFS. And even though btrfs now avoids > > the evil struct buffer_head monster, its use of the IOMAP API seems > > *dramatically* different than XFS, and it probably puzzles many. Is it > > that btrfs managed to just get rid of struct buffer_head use but missed > > fully abstracting working with the page cache? How does one check? What > > semantics do we look for? > > I'm under the impression that for buffered io, btrfs manages pagecache > folios directly, without the use of iomap at all. Hence no mention of > buffer heads in the codebase. > > A big problem I see in porting btrfs/ext*/f2fs/etc to use the iomap > pagecache code is that iomap doesn't support io permutations at all. > It assumes that it can assemble struct bio with pagecache pages and > issue that bio directly to the device. > > IOWs, someone will need to port fscrypt, fsverity, compression, etc. to > iomap before those filesystems can jump off bufferheads. Is it a worthy endeavor to have filesystems use IOMAP to manage working with the pagecache for you? Even if hard, is that a worthy lofty goal to aim for? Do we have general consensus on this? > > When looking to see if one can help on the conversion front with other > > filesystems it begs the question what is the correct real end goal. What > > should one strive for? And it also gets me wondering, if we wanted to abstract > > the page cache from scratch again, would we have done this a bit differently > > now? Are there lessons from the network filesystem side of things which > > can be shared? If so it gets me wondering if this instead should be > > about why that's a good idea and what should that look like. > > All good questions to ask. > > > Perhaps fs/buffers.c could be converted to folios only, and be done > > with it. But would we be loosing out on something? What would that be? > > Dirty secret here: I still don't understand what buffers.c does. ;) I'm on the same boat :D Luis ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-01-29 4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain 2023-01-29 5:06 ` Matthew Wilcox 2023-02-27 19:06 ` Darrick J. Wong @ 2023-03-01 16:59 ` Ritesh Harjani 2023-03-01 17:08 ` Darrick J. Wong 2 siblings, 1 reply; 23+ messages in thread From: Ritesh Harjani @ 2023-03-01 16:59 UTC (permalink / raw) To: Luis Chamberlain, lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells Cc: Luis Chamberlain, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm Luis Chamberlain <mcgrof@kernel.org> writes: > One of the recurring themes that comes up at LSF is "iomap has little > to no documentation, it is hard to use". I've only recently taken a > little nose dive into it, and so I also can frankly admit to say I don't > grok it well either yet. However, the *general* motivation and value is clear: > avoiding the old ugly monster of struct buffer_head, and abstracting > the page cache for non network filesystems, and that is because for > network filesystems my understanding is that we have another side effort > for that. We could go a bit down memory lane on prior attempts to kill > the struct buffer_head evil demon from Linux, or why its evil, but I'm not > sure if recapping that is useful at this point in time, let me know, I could > do that if it helps if folks want to talk about this at LSF. For now I rather It would certainly help to hear on what are our plans of IOMAP_F_BUFFER_HEAD flag and it's related code. I know it is there for gfs2, but it would be good to know on what are our plans before we start converting all other filesystems to move to iomap? Do we advise on not to use this path for other filesystems? Do we plan to deprecate it in order to kill buffer heads in future? e.g. root> git grep "buffer_head" fs/iomap/ fs/iomap/buffered-io.c:#include <linux/buffer_head.h> Wanted more insights on this and our plans w.r.t other filesystem wanting to use it. So a short walk down the memory lane and our plans for future w.r.t IOMAP_F_BUFFER_HEAD would certainly help. Thanks -ritesh ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: LSF/MM/BPF 2023 IOMAP conversion status update 2023-03-01 16:59 ` Ritesh Harjani @ 2023-03-01 17:08 ` Darrick J. Wong 0 siblings, 0 replies; 23+ messages in thread From: Darrick J. Wong @ 2023-03-01 17:08 UTC (permalink / raw) To: Ritesh Harjani Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells, kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel, linux-mm On Wed, Mar 01, 2023 at 10:29:56PM +0530, Ritesh Harjani wrote: > Luis Chamberlain <mcgrof@kernel.org> writes: > > > One of the recurring themes that comes up at LSF is "iomap has little > > to no documentation, it is hard to use". I've only recently taken a > > little nose dive into it, and so I also can frankly admit to say I don't > > grok it well either yet. However, the *general* motivation and value is clear: > > avoiding the old ugly monster of struct buffer_head, and abstracting > > the page cache for non network filesystems, and that is because for > > network filesystems my understanding is that we have another side effort > > for that. We could go a bit down memory lane on prior attempts to kill > > the struct buffer_head evil demon from Linux, or why its evil, but I'm not > > sure if recapping that is useful at this point in time, let me know, I could > > do that if it helps if folks want to talk about this at LSF. For now I rather > > It would certainly help to hear on what are our plans of > IOMAP_F_BUFFER_HEAD flag and it's related code. I know it is there > for gfs2, but it would be good to know on what are our plans before we > start converting all other filesystems to move to iomap? > Do we advise on not to use this path for other filesystems? Do we plan > to deprecate it in order to kill buffer heads in future? > e.g. > root> git grep "buffer_head" fs/iomap/ > fs/iomap/buffered-io.c:#include <linux/buffer_head.h> > > Wanted more insights on this and our plans w.r.t other filesystem > wanting to use it. So a short walk down the memory lane and our plans > for future w.r.t IOMAP_F_BUFFER_HEAD would certainly help. For filesystems considering an iomap port, my advice is: If your filesystem is simple (e.g. direct overwrites, no cow, no verity, no fscrypt, etc) then you ought to consider jumping to iomap directly and moving off bufferheads forever. If I were working on a port of something simple(ish) like ext2 or fat or something, that's how I'd go. Obviously, any filesystem that does not use bufferheads and ports to iomap should go straight there. Do not *start* using bufferheads. For filesystems where things are more complicated (ext4/jbd2) it might make more sense to port to iomap with bufferheads in one release to make sure you've gotten the locking, iomap-ops, and writeback working correctly. Once that's done, then move off bufferheads. gfs2 might want to move off of bufferheads some day, but I think they're still letting the dust settle on the new iomap plumbing. IOWs, I don't see IOMAP_F_BUFFER_HEAD going away until there's no longer any interest in it. --D > Thanks > -ritesh ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2023-03-30 0:02 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-29 4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain 2023-01-29 5:06 ` Matthew Wilcox 2023-01-29 5:39 ` Luis Chamberlain 2023-02-08 16:04 ` Jan Kara 2023-02-24 7:01 ` Zhang Yi 2023-02-26 20:16 ` Ritesh Harjani 2023-03-16 14:40 ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM) 2023-03-16 15:41 ` Darrick J. Wong 2023-03-20 16:11 ` Ritesh Harjani 2023-03-20 13:15 ` Christoph Hellwig 2023-03-20 17:51 ` Jan Kara 2023-03-22 6:34 ` Ritesh Harjani 2023-03-23 11:30 ` Jan Kara 2023-03-23 13:19 ` Ritesh Harjani 2023-03-30 0:02 ` Christoph Hellwig 2023-02-27 19:26 ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong 2023-02-27 21:02 ` Matthew Wilcox 2023-02-27 19:47 ` Darrick J. Wong 2023-02-27 20:24 ` Luis Chamberlain 2023-02-27 19:06 ` Darrick J. Wong 2023-02-27 19:58 ` Luis Chamberlain 2023-03-01 16:59 ` Ritesh Harjani 2023-03-01 17:08 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).