linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* LSF/MM/BPF 2023 IOMAP conversion status update
@ 2023-01-29  4:46 Luis Chamberlain
  2023-01-29  5:06 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Luis Chamberlain @ 2023-01-29  4:46 UTC (permalink / raw)
  To: lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells
  Cc: Luis Chamberlain, kbus >> Keith Busch, Pankaj Raghav,
	linux-fsdevel, linux-mm

One of the recurring themes that comes up at LSF is "iomap has little
to no documentation, it is hard to use". I've only recently taken a
little nose dive into it, and so I also can frankly admit to say I don't
grok it well either yet. However, the *general* motivation and value is clear:
avoiding the old ugly monster of struct buffer_head, and abstracting
the page cache for non network filesystems, and that is because for
network filesystems my understanding is that we have another side effort
for that. We could go a bit down memory lane on prior attempts to kill
the struct buffer_head evil demon from Linux, or why its evil, but I'm not
sure if recapping that is useful at this point in time, let me know, I could
do that if it helps if folks want to talk about this at LSF. For now I rather
instead focus on sharing efforts to review where we are today on the effort
towards conversion towards IOMAP for some of the major filesystems:

https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_225

I'm hoping this *might* be useful to some, but I fear it may leave quite
a bit of folks with more questions than answers as it did for me. And
hence I figured that *this aspect of this topic* perhaps might be a good
topic for LSF.  The end goal would hopefully then be finally enabling us
to document IOMAP API properly and helping with the whole conversion
effort.

My gatherings from this quick review of API evolution and use is that,
XFS is *certainly* a first class citizen user. No surprise there if a
lot of the effort came out from XFS. And even though btrfs now avoids
the evil struct buffer_head monster, its use of the IOMAP API seems
*dramatically* different than XFS, and it probably puzzles many. Is it
that btrfs managed to just get rid of struct buffer_head use but missed
fully abstracting working with the page cache? How does one check? What
semantics do we look for?

When looking to see if one can help on the conversion front with other
filesystems it begs the question what is the correct real end goal. What
should one strive for? And it also gets me wondering, if we wanted to abstract
the page cache from scratch again, would we have done this a bit differently
now? Are there lessons from the network filesystem side of things which
can be shared? If so it gets me wondering if this instead should be
about why that's a good idea and what should that look like.

Perhaps fs/buffers.c could be converted to folios only, and be done
with it. But would we be loosing out on something? What would that be?

  Luis


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-01-29  4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain
@ 2023-01-29  5:06 ` Matthew Wilcox
  2023-01-29  5:39   ` Luis Chamberlain
                     ` (2 more replies)
  2023-02-27 19:06 ` Darrick J. Wong
  2023-03-01 16:59 ` Ritesh Harjani
  2 siblings, 3 replies; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-29  5:06 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, Christoph Hellwig, David Howells,
	kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel,
	linux-mm

On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
> I'm hoping this *might* be useful to some, but I fear it may leave quite
> a bit of folks with more questions than answers as it did for me. And
> hence I figured that *this aspect of this topic* perhaps might be a good
> topic for LSF.  The end goal would hopefully then be finally enabling us
> to document IOMAP API properly and helping with the whole conversion
> effort.

+1 from me.

I've made a couple of abortive efforts to try and convert a "trivial"
filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on
what the semantics are for get_block_t and iomap_begin().

> Perhaps fs/buffers.c could be converted to folios only, and be done
> with it. But would we be loosing out on something? What would that be?

buffer_heads are inefficient for multi-page folios because some of the
algorthims are O(n^2) for n being the number of buffers in a folio.
It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in
a 2MB folio, it's pretty sticky.  Things like "Read I/O has completed on
this buffer, can I mark the folio as Uptodate now?"  For iomap, that's a
scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512
allocations, looking at one bit in each BH before moving on to the next.
Similarly for writeback, iirc.

So +1 from me for a "How do we convert 35-ish block based filesystems
from BHs to iomap for their buffered & direct IO paths".  There's maybe a
separate discussion to be had for "What should the API be for filesystems
to access metadata on the block device" because I don't believe the
page-cache based APIs are easy for fs authors to use.

Maybe some related topics are
"What testing should we require for some of these ancient filesystems?"
"Whose job is it to convert these 35 filesystems anyway, can we just
delete some of them?"
"Is there a lower-performance but easier-to-implement API than iomap
for old filesystems that only exist for compatibiity reasons?"



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-01-29  5:06 ` Matthew Wilcox
@ 2023-01-29  5:39   ` Luis Chamberlain
  2023-02-08 16:04   ` Jan Kara
  2023-02-27 19:47   ` Darrick J. Wong
  2 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2023-01-29  5:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, Christoph Hellwig, David Howells,
	kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel,
	linux-mm

On Sun, Jan 29, 2023 at 05:06:47AM +0000, Matthew Wilcox wrote:
> There's maybe a
> separate discussion to be had for "What should the API be for filesystems
> to access metadata on the block device" because I don't believe the
> page-cache based APIs are easy for fs authors to use.

OK sure, and *why* that would be good, yes sure. Perhaps that should be
dicussed first though as then it I think may be easier to possibly
celebrate IOMAP.

> Maybe some related topics are
> "What testing should we require for some of these ancient filesystems?"

Oh yes....

> "Whose job is it to convert these 35 filesystems anyway, can we just
> delete some of them?"

> "Is there a lower-performance but easier-to-implement API than iomap
> for old filesystems that only exist for compatibiity reasons?"

Yes...

  Luis


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-01-29  5:06 ` Matthew Wilcox
  2023-01-29  5:39   ` Luis Chamberlain
@ 2023-02-08 16:04   ` Jan Kara
  2023-02-24  7:01     ` Zhang Yi
                       ` (2 more replies)
  2023-02-27 19:47   ` Darrick J. Wong
  2 siblings, 3 replies; 23+ messages in thread
From: Jan Kara @ 2023-02-08 16:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells,
	kbus @imap.suse.de>> Keith Busch, Pankaj Raghav,
	linux-fsdevel, linux-mm

On Sun 29-01-23 05:06:47, Matthew Wilcox wrote:
> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
> > I'm hoping this *might* be useful to some, but I fear it may leave quite
> > a bit of folks with more questions than answers as it did for me. And
> > hence I figured that *this aspect of this topic* perhaps might be a good
> > topic for LSF.  The end goal would hopefully then be finally enabling us
> > to document IOMAP API properly and helping with the whole conversion
> > effort.
> 
> +1 from me.
> 
> I've made a couple of abortive efforts to try and convert a "trivial"
> filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on
> what the semantics are for get_block_t and iomap_begin().

Yeah, I'd be also interested in this discussion. In particular as a
maintainer of part of these legacy filesystems (ext2, udf, isofs).

> > Perhaps fs/buffers.c could be converted to folios only, and be done
> > with it. But would we be loosing out on something? What would that be?
> 
> buffer_heads are inefficient for multi-page folios because some of the
> algorthims are O(n^2) for n being the number of buffers in a folio.
> It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in
> a 2MB folio, it's pretty sticky.  Things like "Read I/O has completed on
> this buffer, can I mark the folio as Uptodate now?"  For iomap, that's a
> scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512
> allocations, looking at one bit in each BH before moving on to the next.
> Similarly for writeback, iirc.
> 
> So +1 from me for a "How do we convert 35-ish block based filesystems
> from BHs to iomap for their buffered & direct IO paths".  There's maybe a
> separate discussion to be had for "What should the API be for filesystems
> to access metadata on the block device" because I don't believe the
> page-cache based APIs are easy for fs authors to use.

Yeah, so the actual data paths should be relatively easy for these old
filesystems as they usually don't do anything special (those that do - like
reiserfs - are deprecated and to be removed). But for metadata we do need
some convenience functions like - give me block of metadata at this block
number, make it dirty / clean / uptodate (block granularity dirtying &
uptodate state is absolute must for metadata, otherwise we'll have data
corruption issues). From the more complex functionality we need stuff like:
lock particular block of metadata (equivalent of buffer lock), track that
this block is metadata for given inode so that it can be written on
fsync(2). Then more fancy filesystems like ext4 also need to attach more
private state to each metadata block but that needs to be dealt with on
case-by-case basis anyway.

> Maybe some related topics are
> "What testing should we require for some of these ancient filesystems?"
> "Whose job is it to convert these 35 filesystems anyway, can we just
> delete some of them?"

I would not certainly miss some more filesystems - like minix, sysv, ...
But before really treatening to remove some of these ancient and long
untouched filesystems, we should convert at least those we do care about.
When there's precedent how simple filesystem conversion looks like, it is
easier to argue about what to do with the ones we don't care about so much.

> "Is there a lower-performance but easier-to-implement API than iomap
> for old filesystems that only exist for compatibiity reasons?"

As I wrote above, for metadata there ought to be something as otherwise it
will be real pain (and no gain really). But I guess the concrete API only
matterializes once we attempt a conversion of some filesystem like ext2.
I'll try to have a look into that, at least the obvious preparatory steps
like converting the data paths to iomap.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-02-08 16:04   ` Jan Kara
@ 2023-02-24  7:01     ` Zhang Yi
  2023-02-26 20:16     ` Ritesh Harjani
  2023-02-27 19:26     ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong
  2 siblings, 0 replies; 23+ messages in thread
From: Zhang Yi @ 2023-02-24  7:01 UTC (permalink / raw)
  To: Luis Chamberlain, Jan Kara, Matthew Wilcox
  Cc: lsf-pc, Christoph Hellwig, David Howells,
	kbus @imap.suse.de>> Keith Busch, Pankaj Raghav,
	linux-fsdevel, linux-mm, yi.zhang, guohanjun

On 2023/2/9 0:04, Jan Kara wrote:
> On Sun 29-01-23 05:06:47, Matthew Wilcox wrote:
>> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
>>> I'm hoping this *might* be useful to some, but I fear it may leave quite
>>> a bit of folks with more questions than answers as it did for me. And
>>> hence I figured that *this aspect of this topic* perhaps might be a good
>>> topic for LSF.  The end goal would hopefully then be finally enabling us
>>> to document IOMAP API properly and helping with the whole conversion
>>> effort.
>>
>> +1 from me.
>>
>> I've made a couple of abortive efforts to try and convert a "trivial"
>> filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on
>> what the semantics are for get_block_t and iomap_begin().
> 
> Yeah, I'd be also interested in this discussion. In particular as a
> maintainer of part of these legacy filesystems (ext2, udf, isofs).
> 
>>> Perhaps fs/buffers.c could be converted to folios only, and be done
>>> with it. But would we be loosing out on something? What would that be?
>>
>> buffer_heads are inefficient for multi-page folios because some of the
>> algorthims are O(n^2) for n being the number of buffers in a folio.
>> It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in
>> a 2MB folio, it's pretty sticky.  Things like "Read I/O has completed on
>> this buffer, can I mark the folio as Uptodate now?"  For iomap, that's a
>> scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512
>> allocations, looking at one bit in each BH before moving on to the next.
>> Similarly for writeback, iirc.
>>
>> So +1 from me for a "How do we convert 35-ish block based filesystems
>> from BHs to iomap for their buffered & direct IO paths".  There's maybe a
>> separate discussion to be had for "What should the API be for filesystems
>> to access metadata on the block device" because I don't believe the
>> page-cache based APIs are easy for fs authors to use.
> 
> Yeah, so the actual data paths should be relatively easy for these old
> filesystems as they usually don't do anything special (those that do - like
> reiserfs - are deprecated and to be removed). But for metadata we do need
> some convenience functions like - give me block of metadata at this block
> number, make it dirty / clean / uptodate (block granularity dirtying &
> uptodate state is absolute must for metadata, otherwise we'll have data
> corruption issues). From the more complex functionality we need stuff like:
> lock particular block of metadata (equivalent of buffer lock), track that
> this block is metadata for given inode so that it can be written on
> fsync(2). Then more fancy filesystems like ext4 also need to attach more
> private state to each metadata block but that needs to be dealt with on
> case-by-case basis anyway.
> 

Hello, all.

I also interested in this topic, especially for the ext4 filesystem iomap
conversion of buffered IO paths. And also for the discussion of the metadata APIs,
current buffer_heads could lead to many potential problems and brings a lot of
quality challenges to our products. I look forward to more discussion if I can
attend offline.

Thanks,
Yi.

>> Maybe some related topics are
>> "What testing should we require for some of these ancient filesystems?"
>> "Whose job is it to convert these 35 filesystems anyway, can we just
>> delete some of them?"
> 
> I would not certainly miss some more filesystems - like minix, sysv, ...
> But before really treatening to remove some of these ancient and long
> untouched filesystems, we should convert at least those we do care about.
> When there's precedent how simple filesystem conversion looks like, it is
> easier to argue about what to do with the ones we don't care about so much.
> 
>> "Is there a lower-performance but easier-to-implement API than iomap
>> for old filesystems that only exist for compatibiity reasons?"
> 
> As I wrote above, for metadata there ought to be something as otherwise it
> will be real pain (and no gain really). But I guess the concrete API only
> matterializes once we attempt a conversion of some filesystem like ext2.
> I'll try to have a look into that, at least the obvious preparatory steps
> like converting the data paths to iomap.
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-02-08 16:04   ` Jan Kara
  2023-02-24  7:01     ` Zhang Yi
@ 2023-02-26 20:16     ` Ritesh Harjani
  2023-03-16 14:40       ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM)
  2023-02-27 19:26     ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong
  2 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani @ 2023-02-26 20:16 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells,
	kbus @imap.suse.de>> Keith Busch, Pankaj Raghav,
	linux-fsdevel, linux-mm

Jan Kara <jack@suse.cz> writes:

> On Sun 29-01-23 05:06:47, Matthew Wilcox wrote:
>> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
>> > I'm hoping this *might* be useful to some, but I fear it may leave quite
>> > a bit of folks with more questions than answers as it did for me. And
>> > hence I figured that *this aspect of this topic* perhaps might be a good
>> > topic for LSF.  The end goal would hopefully then be finally enabling us
>> > to document IOMAP API properly and helping with the whole conversion
>> > effort.
>>
>> +1 from me.

+1 from my end as well please.

Currently I have also been working on adding subpage size dirty tracking
support to iomap layer so that we don't have the write amplification
problem in case of buffered writes for bs < ps systems [1].
This also improves the performance including in some real world usecases
like postgres db workload.

Now there are some further points that we would like to discuss
on how to optimize/improve iomap dirty bitmap tracking for large folios.
I can try to come up with some ideas in that regard so that we can
discuss about these as well with others [2]

[1]: https://lore.kernel.org/all/cover.1677428794.git.ritesh.list@gmail.com/
[2]: https://lore.kernel.org/all/20230130210113.opdvyliooizicrsk@rh-tp/

>>
>> I've made a couple of abortive efforts to try and convert a "trivial"
>> filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on
>> what the semantics are for get_block_t and iomap_begin().
>
> Yeah, I'd be also interested in this discussion. In particular as a
> maintainer of part of these legacy filesystems (ext2, udf, isofs).
>
>> > Perhaps fs/buffers.c could be converted to folios only, and be done
>> > with it. But would we be loosing out on something? What would that be?
>>
>> buffer_heads are inefficient for multi-page folios because some of the
>> algorthims are O(n^2) for n being the number of buffers in a folio.
>> It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in
>> a 2MB folio, it's pretty sticky.  Things like "Read I/O has completed on
>> this buffer, can I mark the folio as Uptodate now?"  For iomap, that's a
>> scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512
>> allocations, looking at one bit in each BH before moving on to the next.
>> Similarly for writeback, iirc.
>>
>> So +1 from me for a "How do we convert 35-ish block based filesystems
>> from BHs to iomap for their buffered & direct IO paths".  There's maybe a
>> separate discussion to be had for "What should the API be for filesystems
>> to access metadata on the block device" because I don't believe the
>> page-cache based APIs are easy for fs authors to use.
>
> Yeah, so the actual data paths should be relatively easy for these old
> filesystems as they usually don't do anything special (those that do - like
> reiserfs - are deprecated and to be removed). But for metadata we do need
> some convenience functions like - give me block of metadata at this block
> number, make it dirty / clean / uptodate (block granularity dirtying &
> uptodate state is absolute must for metadata, otherwise we'll have data
> corruption issues). From the more complex functionality we need stuff like:
> lock particular block of metadata (equivalent of buffer lock), track that
> this block is metadata for given inode so that it can be written on
> fsync(2). Then more fancy filesystems like ext4 also need to attach more
> private state to each metadata block but that needs to be dealt with on
> case-by-case basis anyway.
>
>> Maybe some related topics are
>> "What testing should we require for some of these ancient filesystems?"
>> "Whose job is it to convert these 35 filesystems anyway, can we just
>> delete some of them?"
>
> I would not certainly miss some more filesystems - like minix, sysv, ...
> But before really treatening to remove some of these ancient and long
> untouched filesystems, we should convert at least those we do care about.
> When there's precedent how simple filesystem conversion looks like, it is
> easier to argue about what to do with the ones we don't care about so much.
>
>> "Is there a lower-performance but easier-to-implement API than iomap
>> for old filesystems that only exist for compatibiity reasons?"
>
> As I wrote above, for metadata there ought to be something as otherwise it
> will be real pain (and no gain really). But I guess the concrete API only
> matterializes once we attempt a conversion of some filesystem like ext2.
> I'll try to have a look into that, at least the obvious preparatory steps
> like converting the data paths to iomap.

I have worked in past with Jan and others on adding iomap support to DIO
for ext4. I have also added fiemap, bmap and swap operations to move to iomap
for ext4 and I would like to continue working in this direction for
making ext4 completely switch to iomap (including the buffered-io path).

I would definitely like to convert ext2 to iomap (work along with you on
this of course :)). This in my opinion too, can help us figure
out whether iomap requires any changes so as to support ext* family of
filesystem specially for buffered io path. This IMO is simpler then to
start with ext4 buffered-io path which makes fscrypt, fsverity etc.
get in our way...

Let me also take a look at this and get back to you. At first I will get
started with ext2 DIO path (which I think should be very straight
forward, since we already have iomap_ops for DAX in ext2). While at it,
I will also see what is required for the buffered I/O path conversion.

[3]: https://lore.kernel.org/all/?q=s%3Aiomap+and+f%3Aritesh

-ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-01-29  4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain
  2023-01-29  5:06 ` Matthew Wilcox
@ 2023-02-27 19:06 ` Darrick J. Wong
  2023-02-27 19:58   ` Luis Chamberlain
  2023-03-01 16:59 ` Ritesh Harjani
  2 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-02-27 19:06 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells,
	kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel,
	linux-mm

On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
> One of the recurring themes that comes up at LSF is "iomap has little
> to no documentation, it is hard to use". I've only recently taken a
> little nose dive into it, and so I also can frankly admit to say I don't
> grok it well either yet. However, the *general* motivation and value is clear:
> avoiding the old ugly monster of struct buffer_head, and abstracting
> the page cache for non network filesystems, and that is because for
> network filesystems my understanding is that we have another side effort
> for that. We could go a bit down memory lane on prior attempts to kill
> the struct buffer_head evil demon from Linux, or why its evil, but I'm not
> sure if recapping that is useful at this point in time, let me know, I could
> do that if it helps if folks want to talk about this at LSF.

I think there's so much to talk about WRT iomap that I also think we're
not going to have time for recapping why buffer heads are awful.

> For now I rather
> instead focus on sharing efforts to review where we are today on the effort
> towards conversion towards IOMAP for some of the major filesystems:
> 
> https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_225

Ooh, slides, excellent!

Slide 3 -- there are a lot of separate struct iomap_ops in XFS because
we learned (somewhat late) that creating a separate ops structure (and
functions) for each IO path / use case is much more sanity-preserving
than multiplexing everything into one gigantic function.

Now we have four different write codepaths in the kernel:

extern const struct iomap_ops xfs_buffered_write_iomap_ops;
extern const struct iomap_ops xfs_page_mkwrite_iomap_ops;

These two use delayed allocation and use the pagecache for COW.

extern const struct iomap_ops xfs_direct_write_iomap_ops;
extern const struct iomap_ops xfs_dax_write_iomap_ops;

These two don't use delayed allocation.  IIRC the DAX ops also take care
of pre-zeroing allocations since we assume that memcpy to pmem is
relatively "cheap".

Only one is needed to cover the read cases and FIEMAP:

extern const struct iomap_ops xfs_read_iomap_ops;

And then there's the weird ones for SEEK_{DATA,HOLE}:

extern const struct iomap_ops xfs_seek_iomap_ops;

And FIEMAP for xattrs:

extern const struct iomap_ops xfs_xattr_iomap_ops;

(I'll get to xfs_writeback_ops downthread.)

> I'm hoping this *might* be useful to some, but I fear it may leave quite
> a bit of folks with more questions than answers as it did for me. And
> hence I figured that *this aspect of this topic* perhaps might be a good
> topic for LSF.  The end goal would hopefully then be finally enabling us
> to document IOMAP API properly and helping with the whole conversion
> effort.

Heh.  Yes.  Absolutely yes.

struct iomap_ops is a giant ball of frustration:

	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
			unsigned flags, struct iomap *iomap,
			struct iomap *srcmap);

If you're a filesystem trying to use iomap, it's not at all clear when
you're supposed to fill out @iomap and when you need to fill out
@srcmap.  Right now, the rule is that an ->iomap_begin implementation
should always fill out @iomap.  If the operation is a copy-on-write,
only then does the ->iomap_begin function need to fill out @srcmap.

I've been mulling over redesigning this so that the two parameters are
@read_iomap and @write_iomap.  Any time there's a place from which to
read data, the ->iomap_begin function would fill out @read_iomap.  This
would also be the vessel for SEEK_*, FIEMAP, and swapfile activation.

Any time there's a place in which to write data, the ->iomap_begin
function would fill out @write_iomap.  For a COW, the read and write
mappings would be different.  For a pure overwrite, they'd be the same.

This is going to take a bit of thought to get right.  It'll be easier
for XFS because we split the iomap_ops to handle different
responsibilities, but I haven't looked all that far into ext4 and btrfs
to see what they do.

I also want to convert the inode/pos/length/flags arguments to a const
pointer to struct iomap_iter to reduce the register count.

> My gatherings from this quick review of API evolution and use is that,
> XFS is *certainly* a first class citizen user. No surprise there if a
> lot of the effort came out from XFS. And even though btrfs now avoids
> the evil struct buffer_head monster, its use of the IOMAP API seems
> *dramatically* different than XFS, and it probably puzzles many. Is it
> that btrfs managed to just get rid of struct buffer_head use but missed
> fully abstracting working with the page cache? How does one check? What
> semantics do we look for?

I'm under the impression that for buffered io, btrfs manages pagecache
folios directly, without the use of iomap at all.  Hence no mention of
buffer heads in the codebase.

A big problem I see in porting btrfs/ext*/f2fs/etc to use the iomap
pagecache code is that iomap doesn't support io permutations at all.
It assumes that it can assemble struct bio with pagecache pages and
issue that bio directly to the device.

IOWs, someone will need to port fscrypt, fsverity, compression, etc. to
iomap before those filesystems can jump off bufferheads.

> When looking to see if one can help on the conversion front with other
> filesystems it begs the question what is the correct real end goal. What
> should one strive for? And it also gets me wondering, if we wanted to abstract
> the page cache from scratch again, would we have done this a bit differently
> now? Are there lessons from the network filesystem side of things which
> can be shared? If so it gets me wondering if this instead should be
> about why that's a good idea and what should that look like.

All good questions to ask.

> Perhaps fs/buffers.c could be converted to folios only, and be done
> with it. But would we be loosing out on something? What would that be?

Dirty secret here: I still don't understand what buffers.c does. ;)

--D

>   Luis


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-02-08 16:04   ` Jan Kara
  2023-02-24  7:01     ` Zhang Yi
  2023-02-26 20:16     ` Ritesh Harjani
@ 2023-02-27 19:26     ` Darrick J. Wong
  2023-02-27 21:02       ` Matthew Wilcox
  2 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-02-27 19:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, Christoph Hellwig,
	David Howells, kbus @imap.suse.de>> Keith Busch,
	Pankaj Raghav, linux-fsdevel, linux-mm

On Wed, Feb 08, 2023 at 05:04:22PM +0100, Jan Kara wrote:
> On Sun 29-01-23 05:06:47, Matthew Wilcox wrote:
> > On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
> > > I'm hoping this *might* be useful to some, but I fear it may leave quite
> > > a bit of folks with more questions than answers as it did for me. And
> > > hence I figured that *this aspect of this topic* perhaps might be a good
> > > topic for LSF.  The end goal would hopefully then be finally enabling us
> > > to document IOMAP API properly and helping with the whole conversion
> > > effort.
> > 
> > +1 from me.
> > 
> > I've made a couple of abortive efforts to try and convert a "trivial"
> > filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on
> > what the semantics are for get_block_t and iomap_begin().
> 
> Yeah, I'd be also interested in this discussion. In particular as a
> maintainer of part of these legacy filesystems (ext2, udf, isofs).
> 
> > > Perhaps fs/buffers.c could be converted to folios only, and be done
> > > with it. But would we be loosing out on something? What would that be?
> > 
> > buffer_heads are inefficient for multi-page folios because some of the
> > algorthims are O(n^2) for n being the number of buffers in a folio.
> > It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in
> > a 2MB folio, it's pretty sticky.  Things like "Read I/O has completed on
> > this buffer, can I mark the folio as Uptodate now?"  For iomap, that's a
> > scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512
> > allocations, looking at one bit in each BH before moving on to the next.
> > Similarly for writeback, iirc.
> > 
> > So +1 from me for a "How do we convert 35-ish block based filesystems
> > from BHs to iomap for their buffered & direct IO paths".  There's maybe a
> > separate discussion to be had for "What should the API be for filesystems
> > to access metadata on the block device" because I don't believe the
> > page-cache based APIs are easy for fs authors to use.
> 
> Yeah, so the actual data paths should be relatively easy for these old
> filesystems as they usually don't do anything special (those that do - like
> reiserfs - are deprecated and to be removed). But for metadata we do need
> some convenience functions like - give me block of metadata at this block
> number, make it dirty / clean / uptodate (block granularity dirtying &
> uptodate state is absolute must for metadata, otherwise we'll have data
> corruption issues). From the more complex functionality we need stuff like:
> lock particular block of metadata (equivalent of buffer lock), track that
> this block is metadata for given inode so that it can be written on
> fsync(2). Then more fancy filesystems like ext4 also need to attach more
> private state to each metadata block but that needs to be dealt with on
> case-by-case basis anyway.

I reiterate a years-ago suggestion from Dave Chinner to reintroduce a
(metadata-only) buffer cache for filesystems.

xfs_buf already does everything you want -- read a buffer, mark it
dirty, bwrite it to storage, lock it, unlock it, mark it stale, etc.
Memory pages are tracked (and reclaimed) separately from the bdev page
cache, which means filesystem authors don't need to figure out the
pagecache APIs.

Upside/downside: Userspace programs scribbling on a mounted block device
lose some ability to screw over that mounted filesystem.  tune2fs loses
the ability to change ext4 labels.

Things get harder when you want to feed buffers to jbd2, since IIRC jbd2
tracks its own state in the journal head.  Some the pieces are similar
to the xfs_buf_log_item (shadow buffers, list of items in transaction),
but others (triggers) aren't.  Given that jbd2 and ext4 both want to
attach other bits of information, I suspect you'd have to let the
filesystem allocate the struct fsbuf objects.

OTOH, once you do this, I think it's shouldn't be hard to remove buffer
heads from ext* except the crypt/verity files.

> > Maybe some related topics are
> > "What testing should we require for some of these ancient filesystems?"
> > "Whose job is it to convert these 35 filesystems anyway, can we just
> > delete some of them?"
> 
> I would not certainly miss some more filesystems - like minix, sysv, ...
> But before really treatening to remove some of these ancient and long
> untouched filesystems, we should convert at least those we do care about.
> When there's precedent how simple filesystem conversion looks like, it is
> easier to argue about what to do with the ones we don't care about so much.

AFAICT, most of those old filesystems are pretty simple -- they don't do
the fancy cow and transformation things that modern filesystems do.  My
guess is that rewiring the buffered IO path wouldn't be that hard.
Dealing with all the other bitrot (metadata buffer heads, lack of
testing) is what's going to kill them or force them into becoming fuse
servers that we can isolate to userspace.

Maybe that isn't such a bad thing.

> > "Is there a lower-performance but easier-to-implement API than iomap
> > for old filesystems that only exist for compatibiity reasons?"
> 
> As I wrote above, for metadata there ought to be something as otherwise it
> will be real pain (and no gain really). But I guess the concrete API only
> matterializes once we attempt a conversion of some filesystem like ext2.
> I'll try to have a look into that, at least the obvious preparatory steps
> like converting the data paths to iomap.

willy's tried this.

--D

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-01-29  5:06 ` Matthew Wilcox
  2023-01-29  5:39   ` Luis Chamberlain
  2023-02-08 16:04   ` Jan Kara
@ 2023-02-27 19:47   ` Darrick J. Wong
  2023-02-27 20:24     ` Luis Chamberlain
  2 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-02-27 19:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, David Howells,
	kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel,
	linux-mm

On Sun, Jan 29, 2023 at 05:06:47AM +0000, Matthew Wilcox wrote:
> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
> > I'm hoping this *might* be useful to some, but I fear it may leave quite
> > a bit of folks with more questions than answers as it did for me. And
> > hence I figured that *this aspect of this topic* perhaps might be a good
> > topic for LSF.  The end goal would hopefully then be finally enabling us
> > to document IOMAP API properly and helping with the whole conversion
> > effort.
> 
> +1 from me.
> 
> I've made a couple of abortive efforts to try and convert a "trivial"
> filesystem like ext2/ufs/sysv/jfs to iomap, and I always get hung up on
> what the semantics are for get_block_t and iomap_begin().

Yup.  I wrote about it a little bit here:

https://lore.kernel.org/linux-fsdevel/Y%2Fz%2FJrV8qRhUcqE7@magnolia/T/#mda6c3175857d1e4cba88dca042fee030207df4f6

...and promised that I'd get back to writeback.

For buffered IO, iomap does things in a much different order than (I
think) most filesystems.  Traditionally, I think the order is i_rwsem ->
mmap_invalidatelock(?) -> page lock -> get mapping.

iomap relies on the callers to take i_rwsem, asks the filesystem for a
mapping (with whatever locking that entails), and only then starts
locking pagecache folios to operate on them.  IOWs, involving the
filesystem earlier in the process enables it to make better decisions
about space allocations, which in turn should make things faster and
less fragmenty.

OTOH, it also means that we've learned the hard way that pagecache
operations need a means to revalidate mappings to avoid write races.
This applies both to the initial pagecache write and to scheduling
writeback, but the mechanisms for each were developed separately and
years apart.  See iomap::validity_cookie and
xfs_writepage_ctx::{data,cow}_seq for what I'm talking about.
We (xfs developers) ought to figure out if these two mechanisms should
be merged before more filesystems start using iomap for buffered io.

I'd like to have a discussion about how to clean up and clarify the
iomap interfaces, and a separate one about how to port the remaining 35+
filesystems.  I don't know how exactly to split this into LSF sessions,
other than to suggest at least two.

If hch or dchinner show up, I also want to drag them into this. :)

--D

> > Perhaps fs/buffers.c could be converted to folios only, and be done
> > with it. But would we be loosing out on something? What would that be?
> 
> buffer_heads are inefficient for multi-page folios because some of the
> algorthims are O(n^2) for n being the number of buffers in a folio.
> It's fine for 8x 512b buffers in a 4k page, but for 512x 4kb buffers in
> a 2MB folio, it's pretty sticky.  Things like "Read I/O has completed on
> this buffer, can I mark the folio as Uptodate now?"  For iomap, that's a
> scan of a 64 byte bitmap up to 512 times; for BHs, it's a loop over 512
> allocations, looking at one bit in each BH before moving on to the next.
> Similarly for writeback, iirc.
> 
> So +1 from me for a "How do we convert 35-ish block based filesystems
> from BHs to iomap for their buffered & direct IO paths".  There's maybe a
> separate discussion to be had for "What should the API be for filesystems
> to access metadata on the block device" because I don't believe the
> page-cache based APIs are easy for fs authors to use.
> 
> Maybe some related topics are
> "What testing should we require for some of these ancient filesystems?"
> "Whose job is it to convert these 35 filesystems anyway, can we just
> delete some of them?"
> "Is there a lower-performance but easier-to-implement API than iomap
> for old filesystems that only exist for compatibiity reasons?"
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-02-27 19:06 ` Darrick J. Wong
@ 2023-02-27 19:58   ` Luis Chamberlain
  0 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2023-02-27 19:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: lsf-pc, Christoph Hellwig, Matthew Wilcox, David Howells,
	kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel,
	linux-mm

On Mon, Feb 27, 2023 at 11:06:14AM -0800, Darrick J. Wong wrote:
> On Sat, Jan 28, 2023 at 08:46:45PM -0800, Luis Chamberlain wrote:
> > One of the recurring themes that comes up at LSF is "iomap has little
> > to no documentation, it is hard to use". I've only recently taken a
> > little nose dive into it, and so I also can frankly admit to say I don't
> > grok it well either yet. However, the *general* motivation and value is clear:
> > avoiding the old ugly monster of struct buffer_head, and abstracting
> > the page cache for non network filesystems, and that is because for
> > network filesystems my understanding is that we have another side effort
> > for that. We could go a bit down memory lane on prior attempts to kill
> > the struct buffer_head evil demon from Linux, or why its evil, but I'm not
> > sure if recapping that is useful at this point in time, let me know, I could
> > do that if it helps if folks want to talk about this at LSF.
> 
> I think there's so much to talk about WRT iomap that I also think we're
> not going to have time for recapping why buffer heads are awful.

Does this mean you'll finally be attending a conference (if invited)?!

I have some private write ups on this as I reviewed *dozens* or threads
on prior efforts to address this issue before, so I could try to
condense some of this as a preamble write up before this session,
should this session be a candidate discussion.

> > For now I rather
> > instead focus on sharing efforts to review where we are today on the effort
> > towards conversion towards IOMAP for some of the major filesystems:
> > 
> > https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_225
> 
> Ooh, slides, excellent!
> 
> Slide 3 -- there are a lot of separate struct iomap_ops in XFS because
> we learned (somewhat late) that creating a separate ops structure (and
> functions) for each IO path / use case is much more sanity-preserving
> than multiplexing everything into one gigantic function.

This is of huge value. Is is interesting that this was found through
experience, ie empircally, rather than early on, and it made me wonder
if things could be re-done -- what would we do?

If that was useful for XFS -- could this also be usefulf for other
filesytems? Could implying this somehow on the IOMAP API be useful?

Based on your experience *should* filesystem authors *strive* towards
splitting up the paths as well for IOMAP use?

> Now we have four different write codepaths in the kernel:
> 
> extern const struct iomap_ops xfs_buffered_write_iomap_ops;
> extern const struct iomap_ops xfs_page_mkwrite_iomap_ops;
> 
> These two use delayed allocation and use the pagecache for COW.
> 
> extern const struct iomap_ops xfs_direct_write_iomap_ops;
> extern const struct iomap_ops xfs_dax_write_iomap_ops;
> 
> These two don't use delayed allocation.  IIRC the DAX ops also take care
> of pre-zeroing allocations since we assume that memcpy to pmem is
> relatively "cheap".
> 
> Only one is needed to cover the read cases and FIEMAP:
> 
> extern const struct iomap_ops xfs_read_iomap_ops;
> 
> And then there's the weird ones for SEEK_{DATA,HOLE}:
> 
> extern const struct iomap_ops xfs_seek_iomap_ops;
> 
> And FIEMAP for xattrs:
> 
> extern const struct iomap_ops xfs_xattr_iomap_ops;
> 
> (I'll get to xfs_writeback_ops downthread.)
> 
> > I'm hoping this *might* be useful to some, but I fear it may leave quite
> > a bit of folks with more questions than answers as it did for me. And
> > hence I figured that *this aspect of this topic* perhaps might be a good
> > topic for LSF.  The end goal would hopefully then be finally enabling us
> > to document IOMAP API properly and helping with the whole conversion
> > effort.
> 
> Heh.  Yes.  Absolutely yes.
> 
> struct iomap_ops is a giant ball of frustration:
> 
> 	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
> 			unsigned flags, struct iomap *iomap,
> 			struct iomap *srcmap);
> 
> If you're a filesystem trying to use iomap, it's not at all clear when
> you're supposed to fill out @iomap and when you need to fill out
> @srcmap.  Right now, the rule is that an ->iomap_begin implementation
> should always fill out @iomap.  If the operation is a copy-on-write,
> only then does the ->iomap_begin function need to fill out @srcmap.
> 
> I've been mulling over redesigning this so that the two parameters are
> @read_iomap and @write_iomap.  Any time there's a place from which to
> read data, the ->iomap_begin function would fill out @read_iomap.  This
> would also be the vessel for SEEK_*, FIEMAP, and swapfile activation.
> 
> Any time there's a place in which to write data, the ->iomap_begin
> function would fill out @write_iomap.  For a COW, the read and write
> mappings would be different.  For a pure overwrite, they'd be the same.
> 
> This is going to take a bit of thought to get right.  It'll be easier
> for XFS because we split the iomap_ops to handle different
> responsibilities, but I haven't looked all that far into ext4 and btrfs
> to see what they do.

IMHO so long as those use-cases don't break we should be fine in
re-thinking IOMAP now. Otherwise a change later would incur much
more care and could regress. A few of us are the brink of helping
with the conversion effort, but these questions seems a bit more
important to address than just trying to reduce the struct buffer_head
count on each filesystem.

> I also want to convert the inode/pos/length/flags arguments to a const
> pointer to struct iomap_iter to reduce the register count.

Neat.

> > My gatherings from this quick review of API evolution and use is that,
> > XFS is *certainly* a first class citizen user. No surprise there if a
> > lot of the effort came out from XFS. And even though btrfs now avoids
> > the evil struct buffer_head monster, its use of the IOMAP API seems
> > *dramatically* different than XFS, and it probably puzzles many. Is it
> > that btrfs managed to just get rid of struct buffer_head use but missed
> > fully abstracting working with the page cache? How does one check? What
> > semantics do we look for?
> 
> I'm under the impression that for buffered io, btrfs manages pagecache
> folios directly, without the use of iomap at all.  Hence no mention of
> buffer heads in the codebase.
> 
> A big problem I see in porting btrfs/ext*/f2fs/etc to use the iomap
> pagecache code is that iomap doesn't support io permutations at all.
> It assumes that it can assemble struct bio with pagecache pages and
> issue that bio directly to the device.
> 
> IOWs, someone will need to port fscrypt, fsverity, compression, etc. to
> iomap before those filesystems can jump off bufferheads.

Is it a worthy endeavor to have filesystems use IOMAP to manage working
with the pagecache for you? Even if hard, is that a worthy lofty goal
to aim for? Do we have general consensus on this?

> > When looking to see if one can help on the conversion front with other
> > filesystems it begs the question what is the correct real end goal. What
> > should one strive for? And it also gets me wondering, if we wanted to abstract
> > the page cache from scratch again, would we have done this a bit differently
> > now? Are there lessons from the network filesystem side of things which
> > can be shared? If so it gets me wondering if this instead should be
> > about why that's a good idea and what should that look like.
> 
> All good questions to ask.
> 
> > Perhaps fs/buffers.c could be converted to folios only, and be done
> > with it. But would we be loosing out on something? What would that be?
> 
> Dirty secret here: I still don't understand what buffers.c does. ;)

I'm on the same boat :D

  Luis


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-02-27 19:47   ` Darrick J. Wong
@ 2023-02-27 20:24     ` Luis Chamberlain
  0 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2023-02-27 20:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, lsf-pc, Christoph Hellwig, David Howells,
	kbus >> Keith Busch, Pankaj Raghav, linux-fsdevel,
	linux-mm

On Mon, Feb 27, 2023 at 11:47:24AM -0800, Darrick J. Wong wrote:
> OTOH, it also means that we've learned the hard way that pagecache
> operations need a means to revalidate mappings to avoid write races.
> This applies both to the initial pagecache write and to scheduling
> writeback, but the mechanisms for each were developed separately and
> years apart.  See iomap::validity_cookie and
> xfs_writepage_ctx::{data,cow}_seq for what I'm talking about.
> We (xfs developers) ought to figure out if these two mechanisms should
> be merged before more filesystems start using iomap for buffered io.

That puts a good yield notice to some conversion efforts, thanks, this
already alone is very useulf.

> I'd like to have a discussion about how to clean up and clarify the
> iomap interfaces, and a separate one about how to port the remaining 35+
> filesystems.  I don't know how exactly to split this into LSF sessions,
> other than to suggest at least two.

From a conversion perspective, ideally if it was obvious I think we
should be able to do some of it ala coccinelle, but I have yet to see
any remotely obvious pattern. And it makes me wonder, should we strive
to make the conversion as painless / obvious ? Is that a good litmus
for when we should be ready to start converting other filesystems?

> If hch or dchinner show up, I also want to drag them into this. :)

And here I've been thinking I had to go to Australia to see you all together.

  Luis


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-02-27 19:26     ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong
@ 2023-02-27 21:02       ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2023-02-27 21:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Luis Chamberlain, lsf-pc, Christoph Hellwig,
	David Howells, kbus @imap.suse.de>> Keith Busch,
	Pankaj Raghav, linux-fsdevel, linux-mm

On Mon, Feb 27, 2023 at 11:26:17AM -0800, Darrick J. Wong wrote:
> > As I wrote above, for metadata there ought to be something as otherwise it
> > will be real pain (and no gain really). But I guess the concrete API only
> > matterializes once we attempt a conversion of some filesystem like ext2.
> > I'll try to have a look into that, at least the obvious preparatory steps
> > like converting the data paths to iomap.
> 
> willy's tried this.

Well, I started on it.  The first thing I looked at was trying to decide
how to handle the equivalent of BH_boundary:

https://lore.kernel.org/linux-fsdevel/20200320144014.3276-1-willy@infradead.org/

I'm no longer sure that's the right approach.  Maybe we do want to have
an IOMAP_F_BOUNDARY to indicate that the bio should be submitted before
calling iomap_iter() again.  But we're definitely still missing the piece
where we start the I/O to the page by setting iop->read_bytes_pending
to folio_size() and then subtract off each piece as we either zero it
or the I/O completes.

I have a bunch of other pagecache / fs work queued up for this cycle, so
I wasn't planning on leaping back into this.  But it's worth making sure
hat people know about one of the problems we figured out three years ago:
https://lore.kernel.org/linux-fsdevel/20200505190825.GB5694@magnolia/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-01-29  4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain
  2023-01-29  5:06 ` Matthew Wilcox
  2023-02-27 19:06 ` Darrick J. Wong
@ 2023-03-01 16:59 ` Ritesh Harjani
  2023-03-01 17:08   ` Darrick J. Wong
  2 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani @ 2023-03-01 16:59 UTC (permalink / raw)
  To: Luis Chamberlain, lsf-pc, Christoph Hellwig, Matthew Wilcox,
	David Howells
  Cc: Luis Chamberlain, kbus >> Keith Busch, Pankaj Raghav,
	linux-fsdevel, linux-mm

Luis Chamberlain <mcgrof@kernel.org> writes:

> One of the recurring themes that comes up at LSF is "iomap has little
> to no documentation, it is hard to use". I've only recently taken a
> little nose dive into it, and so I also can frankly admit to say I don't
> grok it well either yet. However, the *general* motivation and value is clear:
> avoiding the old ugly monster of struct buffer_head, and abstracting
> the page cache for non network filesystems, and that is because for
> network filesystems my understanding is that we have another side effort
> for that. We could go a bit down memory lane on prior attempts to kill
> the struct buffer_head evil demon from Linux, or why its evil, but I'm not
> sure if recapping that is useful at this point in time, let me know, I could
> do that if it helps if folks want to talk about this at LSF. For now I rather

It would certainly help to hear on what are our plans of
IOMAP_F_BUFFER_HEAD flag and it's related code. I know it is there
for gfs2, but it would be good to know on what are our plans before we
start converting all other filesystems to move to iomap?
Do we advise on not to use this path for other filesystems? Do we plan
to deprecate it in order to kill buffer heads in future?
e.g.
root> git grep "buffer_head" fs/iomap/
fs/iomap/buffered-io.c:#include <linux/buffer_head.h>

Wanted more insights on this and our plans w.r.t other filesystem
wanting to use it. So a short walk down the memory lane and our plans
for future w.r.t IOMAP_F_BUFFER_HEAD would certainly help.

Thanks
-ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LSF/MM/BPF 2023 IOMAP conversion status update
  2023-03-01 16:59 ` Ritesh Harjani
@ 2023-03-01 17:08   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-03-01 17:08 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Luis Chamberlain, lsf-pc, Christoph Hellwig, Matthew Wilcox,
	David Howells, kbus >> Keith Busch, Pankaj Raghav,
	linux-fsdevel, linux-mm

On Wed, Mar 01, 2023 at 10:29:56PM +0530, Ritesh Harjani wrote:
> Luis Chamberlain <mcgrof@kernel.org> writes:
> 
> > One of the recurring themes that comes up at LSF is "iomap has little
> > to no documentation, it is hard to use". I've only recently taken a
> > little nose dive into it, and so I also can frankly admit to say I don't
> > grok it well either yet. However, the *general* motivation and value is clear:
> > avoiding the old ugly monster of struct buffer_head, and abstracting
> > the page cache for non network filesystems, and that is because for
> > network filesystems my understanding is that we have another side effort
> > for that. We could go a bit down memory lane on prior attempts to kill
> > the struct buffer_head evil demon from Linux, or why its evil, but I'm not
> > sure if recapping that is useful at this point in time, let me know, I could
> > do that if it helps if folks want to talk about this at LSF. For now I rather
> 
> It would certainly help to hear on what are our plans of
> IOMAP_F_BUFFER_HEAD flag and it's related code. I know it is there
> for gfs2, but it would be good to know on what are our plans before we
> start converting all other filesystems to move to iomap?
> Do we advise on not to use this path for other filesystems? Do we plan
> to deprecate it in order to kill buffer heads in future?
> e.g.
> root> git grep "buffer_head" fs/iomap/
> fs/iomap/buffered-io.c:#include <linux/buffer_head.h>
> 
> Wanted more insights on this and our plans w.r.t other filesystem
> wanting to use it. So a short walk down the memory lane and our plans
> for future w.r.t IOMAP_F_BUFFER_HEAD would certainly help.

For filesystems considering an iomap port, my advice is:

If your filesystem is simple (e.g. direct overwrites, no cow, no verity,
no fscrypt, etc) then you ought to consider jumping to iomap directly
and moving off bufferheads forever.  If I were working on a port of
something simple(ish) like ext2 or fat or something, that's how I'd go.

Obviously, any filesystem that does not use bufferheads and ports to
iomap should go straight there.  Do not *start* using bufferheads.

For filesystems where things are more complicated (ext4/jbd2) it might
make more sense to port to iomap with bufferheads in one release to make
sure you've gotten the locking, iomap-ops, and writeback working
correctly.  Once that's done, then move off bufferheads.

gfs2 might want to move off of bufferheads some day, but I think they're
still letting the dust settle on the new iomap plumbing.

IOWs, I don't see IOMAP_F_BUFFER_HEAD going away until there's no longer
any interest in it.

--D

> Thanks
> -ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-02-26 20:16     ` Ritesh Harjani
@ 2023-03-16 14:40       ` Ritesh Harjani (IBM)
  2023-03-16 15:41         ` Darrick J. Wong
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-03-16 14:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc, Ritesh Harjani (IBM)

[DO NOT MERGE] [WORK-IN-PROGRESS]

Hello Jan,

This is an initial version of the patch set which I wanted to share
before today's call. This is still work in progress but atleast passes
the set of test cases which I had kept for dio testing (except 1 from my
list).

Looks like there won't be much/any changes required from iomap side to
support ext2 moving to iomap apis.

I will be doing some more testing specifically test generic/083 which is
occassionally failing in my testing.
Also once this is stabilized, I can do some performance testing too if you
feel so. Last I remembered we saw some performance regressions when ext4
moved to iomap for dio.

PS: Please ignore if there are some silly mistakes. As I said, I wanted
to get this out before today's discussion. :)

Thanks for your help!!

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/ext2.h  |   1 +
 fs/ext2/file.c  | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c |  20 +--------
 3 files changed, 117 insertions(+), 18 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index cb78d7dcfb95..cb5e309fe040 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
 extern struct inode *ext2_iget (struct super_block *, unsigned long);
 extern int ext2_write_inode (struct inode *, struct writeback_control *);
 extern void ext2_evict_inode(struct inode *);
+extern void ext2_write_failed(struct address_space *mapping, loff_t to);
 extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
 extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
 extern int ext2_getattr (struct mnt_idmap *, const struct path *,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 6b4bebe982ca..7a8561304559 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	return ret;
 }

+static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	ssize_t ret;
+
+	inode_lock_shared(inode);
+	ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
+	inode_unlock_shared(inode);
+
+	return ret;
+}
+
+static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+				 int error, unsigned int flags)
+{
+	loff_t pos = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (error)
+		return error;
+
+	pos += size;
+	if (pos > i_size_read(inode))
+		i_size_write(inode, pos);
+
+	return 0;
+}
+
+static const struct iomap_dio_ops ext2_dio_write_ops = {
+	.end_io = ext2_dio_write_end_io,
+};
+
+static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	ssize_t ret;
+	unsigned int flags;
+	unsigned long blocksize = inode->i_sb->s_blocksize;
+	loff_t offset = iocb->ki_pos;
+	loff_t count = iov_iter_count(from);
+
+
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+	ret = file_remove_privs(file);
+	if (ret)
+		goto out_unlock;
+	ret = file_update_time(file);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
+	 * calls for generic_write_sync in iomap_dio_complete().
+	 * Since ext2_fsync nmust be called w/o inode lock,
+	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
+	 * ourselves.
+	 */
+	flags = IOMAP_DIO_NOSYNC;
+
+	/* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */
+	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
+	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
+		flags |= IOMAP_DIO_FORCE_WAIT;
+
+	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
+			   flags, NULL, 0);
+
+	if (ret == -ENOTBLK)
+		ret = 0;
+
+	if (ret < 0 && ret != -EIOCBQUEUED)
+		ext2_write_failed(inode->i_mapping, offset + count);
+
+	/* handle case for partial write or fallback to buffered write */
+	if (ret >= 0 && iov_iter_count(from)) {
+		loff_t pos, endbyte;
+		ssize_t status;
+		ssize_t ret2;
+
+		pos = iocb->ki_pos;
+		status = generic_perform_write(iocb, from);
+		if (unlikely(status < 0)) {
+			ret = status;
+			goto out_unlock;
+		}
+		endbyte = pos + status - 1;
+		ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
+						    endbyte);
+		if (ret2 == 0) {
+			iocb->ki_pos = endbyte + 1;
+			ret += status;
+			invalidate_mapping_pages(inode->i_mapping,
+						 pos >> PAGE_SHIFT,
+						 endbyte >> PAGE_SHIFT);
+		}
+	}
+out_unlock:
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
+}
+
 static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 #ifdef CONFIG_FS_DAX
 	if (IS_DAX(iocb->ki_filp->f_mapping->host))
 		return ext2_dax_read_iter(iocb, to);
 #endif
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return ext2_dio_read_iter(iocb, to);
+
 	return generic_file_read_iter(iocb, to);
 }

@@ -176,6 +287,9 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_DAX(iocb->ki_filp->f_mapping->host))
 		return ext2_dax_write_iter(iocb, from);
 #endif
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return ext2_dio_write_iter(iocb, from);
+
 	return generic_file_write_iter(iocb, from);
 }

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 26f135e7ffce..7ff669d0b6d2 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -56,7 +56,7 @@ static inline int ext2_inode_is_fast_symlink(struct inode *inode)

 static void ext2_truncate_blocks(struct inode *inode, loff_t offset);

-static void ext2_write_failed(struct address_space *mapping, loff_t to)
+void ext2_write_failed(struct address_space *mapping, loff_t to)
 {
 	struct inode *inode = mapping->host;

@@ -908,22 +908,6 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping,block,ext2_get_block);
 }

-static ssize_t
-ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
-	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count = iov_iter_count(iter);
-	loff_t offset = iocb->ki_pos;
-	ssize_t ret;
-
-	ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
-		ext2_write_failed(mapping, offset + count);
-	return ret;
-}
-
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
@@ -946,7 +930,7 @@ const struct address_space_operations ext2_aops = {
 	.write_begin		= ext2_write_begin,
 	.write_end		= ext2_write_end,
 	.bmap			= ext2_bmap,
-	.direct_IO		= ext2_direct_IO,
+	.direct_IO		= noop_direct_IO,
 	.writepages		= ext2_writepages,
 	.migrate_folio		= buffer_migrate_folio,
 	.is_partially_uptodate	= block_is_partially_uptodate,
--
2.39.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-16 14:40       ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM)
@ 2023-03-16 15:41         ` Darrick J. Wong
  2023-03-20 16:11           ` Ritesh Harjani
  2023-03-20 13:15         ` Christoph Hellwig
  2023-03-20 17:51         ` Jan Kara
  2 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-03-16 15:41 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

On Thu, Mar 16, 2023 at 08:10:29PM +0530, Ritesh Harjani (IBM) wrote:
> [DO NOT MERGE] [WORK-IN-PROGRESS]
> 
> Hello Jan,
> 
> This is an initial version of the patch set which I wanted to share
> before today's call. This is still work in progress but atleast passes
> the set of test cases which I had kept for dio testing (except 1 from my
> list).
> 
> Looks like there won't be much/any changes required from iomap side to
> support ext2 moving to iomap apis.
> 
> I will be doing some more testing specifically test generic/083 which is
> occassionally failing in my testing.
> Also once this is stabilized, I can do some performance testing too if you
> feel so. Last I remembered we saw some performance regressions when ext4
> moved to iomap for dio.
> 
> PS: Please ignore if there are some silly mistakes. As I said, I wanted
> to get this out before today's discussion. :)
> 
> Thanks for your help!!
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext2/ext2.h  |   1 +
>  fs/ext2/file.c  | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext2/inode.c |  20 +--------
>  3 files changed, 117 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index cb78d7dcfb95..cb5e309fe040 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
>  extern struct inode *ext2_iget (struct super_block *, unsigned long);
>  extern int ext2_write_inode (struct inode *, struct writeback_control *);
>  extern void ext2_evict_inode(struct inode *);
> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);
>  extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
>  extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
>  extern int ext2_getattr (struct mnt_idmap *, const struct path *,
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 6b4bebe982ca..7a8561304559 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	return ret;
>  }
> 
> +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file->f_mapping->host;
> +	ssize_t ret;
> +
> +	inode_lock_shared(inode);
> +	ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
> +	inode_unlock_shared(inode);
> +
> +	return ret;
> +}
> +
> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> +				 int error, unsigned int flags)
> +{
> +	loff_t pos = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error)
> +		return error;
> +
> +	pos += size;
> +	if (pos > i_size_read(inode))
> +		i_size_write(inode, pos);
> +
> +	return 0;
> +}
> +
> +static const struct iomap_dio_ops ext2_dio_write_ops = {
> +	.end_io = ext2_dio_write_end_io,
> +};
> +
> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file->f_mapping->host;
> +	ssize_t ret;
> +	unsigned int flags;
> +	unsigned long blocksize = inode->i_sb->s_blocksize;
> +	loff_t offset = iocb->ki_pos;
> +	loff_t count = iov_iter_count(from);
> +
> +
> +	inode_lock(inode);
> +	ret = generic_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out_unlock;
> +	ret = file_remove_privs(file);
> +	if (ret)
> +		goto out_unlock;
> +	ret = file_update_time(file);
> +	if (ret)
> +		goto out_unlock;

kiocb_modified() instead of calling file_remove_privs?

> +
> +	/*
> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
> +	 * calls for generic_write_sync in iomap_dio_complete().
> +	 * Since ext2_fsync nmust be called w/o inode lock,
> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
> +	 * ourselves.
> +	 */
> +	flags = IOMAP_DIO_NOSYNC;
> +
> +	/* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */
> +	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
> +	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
> +		flags |= IOMAP_DIO_FORCE_WAIT;
> +
> +	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
> +			   flags, NULL, 0);
> +
> +	if (ret == -ENOTBLK)
> +		ret = 0;
> +
> +	if (ret < 0 && ret != -EIOCBQUEUED)
> +		ext2_write_failed(inode->i_mapping, offset + count);
> +
> +	/* handle case for partial write or fallback to buffered write */
> +	if (ret >= 0 && iov_iter_count(from)) {
> +		loff_t pos, endbyte;
> +		ssize_t status;
> +		ssize_t ret2;
> +
> +		pos = iocb->ki_pos;
> +		status = generic_perform_write(iocb, from);
> +		if (unlikely(status < 0)) {
> +			ret = status;
> +			goto out_unlock;
> +		}
> +		endbyte = pos + status - 1;
> +		ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
> +						    endbyte);
> +		if (ret2 == 0) {
> +			iocb->ki_pos = endbyte + 1;
> +			ret += status;
> +			invalidate_mapping_pages(inode->i_mapping,
> +						 pos >> PAGE_SHIFT,
> +						 endbyte >> PAGE_SHIFT);
> +		}
> +	}

(Why not fall back to the actual buffered write path?)

Otherwise this looks like a reasonable first start.

--D

> +out_unlock:
> +	inode_unlock(inode);
> +	if (ret > 0)
> +		ret = generic_write_sync(iocb, ret);
> +	return ret;
> +}
> +
>  static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  #ifdef CONFIG_FS_DAX
>  	if (IS_DAX(iocb->ki_filp->f_mapping->host))
>  		return ext2_dax_read_iter(iocb, to);
>  #endif
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return ext2_dio_read_iter(iocb, to);
> +
>  	return generic_file_read_iter(iocb, to);
>  }
> 
> @@ -176,6 +287,9 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (IS_DAX(iocb->ki_filp->f_mapping->host))
>  		return ext2_dax_write_iter(iocb, from);
>  #endif
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return ext2_dio_write_iter(iocb, from);
> +
>  	return generic_file_write_iter(iocb, from);
>  }
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 26f135e7ffce..7ff669d0b6d2 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -56,7 +56,7 @@ static inline int ext2_inode_is_fast_symlink(struct inode *inode)
> 
>  static void ext2_truncate_blocks(struct inode *inode, loff_t offset);
> 
> -static void ext2_write_failed(struct address_space *mapping, loff_t to)
> +void ext2_write_failed(struct address_space *mapping, loff_t to)
>  {
>  	struct inode *inode = mapping->host;
> 
> @@ -908,22 +908,6 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
>  	return generic_block_bmap(mapping,block,ext2_get_block);
>  }
> 
> -static ssize_t
> -ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> -{
> -	struct file *file = iocb->ki_filp;
> -	struct address_space *mapping = file->f_mapping;
> -	struct inode *inode = mapping->host;
> -	size_t count = iov_iter_count(iter);
> -	loff_t offset = iocb->ki_pos;
> -	ssize_t ret;
> -
> -	ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
> -	if (ret < 0 && iov_iter_rw(iter) == WRITE)
> -		ext2_write_failed(mapping, offset + count);
> -	return ret;
> -}
> -
>  static int
>  ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
> @@ -946,7 +930,7 @@ const struct address_space_operations ext2_aops = {
>  	.write_begin		= ext2_write_begin,
>  	.write_end		= ext2_write_end,
>  	.bmap			= ext2_bmap,
> -	.direct_IO		= ext2_direct_IO,
> +	.direct_IO		= noop_direct_IO,
>  	.writepages		= ext2_writepages,
>  	.migrate_folio		= buffer_migrate_folio,
>  	.is_partially_uptodate	= block_is_partially_uptodate,
> --
> 2.39.2
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-16 14:40       ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM)
  2023-03-16 15:41         ` Darrick J. Wong
@ 2023-03-20 13:15         ` Christoph Hellwig
  2023-03-20 17:51         ` Jan Kara
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-03-20 13:15 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

On Thu, Mar 16, 2023 at 08:10:29PM +0530, Ritesh Harjani (IBM) wrote:
> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);

Nit: please don't bother with extents for function prototypes.

> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> +				 int error, unsigned int flags)
> +{
> +	loff_t pos = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error)
> +		return error;
> +
> +	pos += size;
> +	if (pos > i_size_read(inode))
> +		i_size_write(inode, pos);

Doesn't i_size_write need i_mutex protection?

> +	/*
> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
> +	 * calls for generic_write_sync in iomap_dio_complete().
> +	 * Since ext2_fsync nmust be called w/o inode lock,
> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
> +	 * ourselves.
> +	 */
> +	flags = IOMAP_DIO_NOSYNC;

So we added IOMAP_DIO_NOSYNC for btrfs initially, but even btrfs did
not manage to mak it work.  I suspect the right thing here as well
is to call __iomap_dio_rw and then separately after dropping the
lock.  Without that you for example can't take i_mutex in the
end_io handler, which I think we need to do.

We should then remove IOMAP_DIO_NOSYNC again.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-16 15:41         ` Darrick J. Wong
@ 2023-03-20 16:11           ` Ritesh Harjani
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani @ 2023-03-20 16:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Thu, Mar 16, 2023 at 08:10:29PM +0530, Ritesh Harjani (IBM) wrote:
>> [DO NOT MERGE] [WORK-IN-PROGRESS]
>>
>> Hello Jan,
>>
>> This is an initial version of the patch set which I wanted to share
>> before today's call. This is still work in progress but atleast passes
>> the set of test cases which I had kept for dio testing (except 1 from my
>> list).
>>
>> Looks like there won't be much/any changes required from iomap side to
>> support ext2 moving to iomap apis.
>>
>> I will be doing some more testing specifically test generic/083 which is
>> occassionally failing in my testing.
>> Also once this is stabilized, I can do some performance testing too if you
>> feel so. Last I remembered we saw some performance regressions when ext4
>> moved to iomap for dio.
>>
>> PS: Please ignore if there are some silly mistakes. As I said, I wanted
>> to get this out before today's discussion. :)
>>
>> Thanks for your help!!
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/ext2/ext2.h  |   1 +
>>  fs/ext2/file.c  | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/ext2/inode.c |  20 +--------
>>  3 files changed, 117 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
>> index cb78d7dcfb95..cb5e309fe040 100644
>> --- a/fs/ext2/ext2.h
>> +++ b/fs/ext2/ext2.h
>> @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
>>  extern struct inode *ext2_iget (struct super_block *, unsigned long);
>>  extern int ext2_write_inode (struct inode *, struct writeback_control *);
>>  extern void ext2_evict_inode(struct inode *);
>> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);
>>  extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
>>  extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
>>  extern int ext2_getattr (struct mnt_idmap *, const struct path *,
>> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
>> index 6b4bebe982ca..7a8561304559 100644
>> --- a/fs/ext2/file.c
>> +++ b/fs/ext2/file.c
>> @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>>  	return ret;
>>  }
>>
>> +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>> +{
>> +	struct file *file = iocb->ki_filp;
>> +	struct inode *inode = file->f_mapping->host;
>> +	ssize_t ret;
>> +
>> +	inode_lock_shared(inode);
>> +	ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
>> +	inode_unlock_shared(inode);
>> +
>> +	return ret;
>> +}
>> +
>> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>> +				 int error, unsigned int flags)
>> +{
>> +	loff_t pos = iocb->ki_pos;
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> +	if (error)
>> +		return error;
>> +
>> +	pos += size;
>> +	if (pos > i_size_read(inode))
>> +		i_size_write(inode, pos);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct iomap_dio_ops ext2_dio_write_ops = {
>> +	.end_io = ext2_dio_write_end_io,
>> +};
>> +
>> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> +{
>> +	struct file *file = iocb->ki_filp;
>> +	struct inode *inode = file->f_mapping->host;
>> +	ssize_t ret;
>> +	unsigned int flags;
>> +	unsigned long blocksize = inode->i_sb->s_blocksize;
>> +	loff_t offset = iocb->ki_pos;
>> +	loff_t count = iov_iter_count(from);
>> +
>> +
>> +	inode_lock(inode);
>> +	ret = generic_write_checks(iocb, from);
>> +	if (ret <= 0)
>> +		goto out_unlock;
>> +	ret = file_remove_privs(file);
>> +	if (ret)
>> +		goto out_unlock;
>> +	ret = file_update_time(file);
>> +	if (ret)
>> +		goto out_unlock;
>
> kiocb_modified() instead of calling file_remove_privs?

Yes, looks likle it is a replacement for file_remove_privs and
file_update_time().


>> +
>> +	/*
>> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
>> +	 * calls for generic_write_sync in iomap_dio_complete().
>> +	 * Since ext2_fsync nmust be called w/o inode lock,
>> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
>> +	 * ourselves.
>> +	 */
>> +	flags = IOMAP_DIO_NOSYNC;
>> +
>> +	/* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */
>> +	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
>> +	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
>> +		flags |= IOMAP_DIO_FORCE_WAIT;
>> +
>> +	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
>> +			   flags, NULL, 0);
>> +
>> +	if (ret == -ENOTBLK)
>> +		ret = 0;
>> +
>> +	if (ret < 0 && ret != -EIOCBQUEUED)
>> +		ext2_write_failed(inode->i_mapping, offset + count);
>> +
>> +	/* handle case for partial write or fallback to buffered write */
>> +	if (ret >= 0 && iov_iter_count(from)) {
>> +		loff_t pos, endbyte;
>> +		ssize_t status;
>> +		ssize_t ret2;
>> +
>> +		pos = iocb->ki_pos;
>> +		status = generic_perform_write(iocb, from);
>> +		if (unlikely(status < 0)) {
>> +			ret = status;
>> +			goto out_unlock;
>> +		}
>> +		endbyte = pos + status - 1;
>> +		ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
>> +						    endbyte);
>> +		if (ret2 == 0) {
>> +			iocb->ki_pos = endbyte + 1;
>> +			ret += status;
>> +			invalidate_mapping_pages(inode->i_mapping,
>> +						 pos >> PAGE_SHIFT,
>> +						 endbyte >> PAGE_SHIFT);
>> +		}
>> +	}
>
> (Why not fall back to the actual buffered write path?)
>

Because then we can handle everything related to DIO in
ext4_dio_file_write() itself e.g. As per the semantics of DIO we should
ensure that page-cache pages are written to disk and invalidated before
returning (filemap_write_and_wait_range() and invalidate_mapping_pages())

> Otherwise this looks like a reasonable first start.

Thanks!

-ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-16 14:40       ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM)
  2023-03-16 15:41         ` Darrick J. Wong
  2023-03-20 13:15         ` Christoph Hellwig
@ 2023-03-20 17:51         ` Jan Kara
  2023-03-22  6:34           ` Ritesh Harjani
  2 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2023-03-20 17:51 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

On Thu 16-03-23 20:10:29, Ritesh Harjani (IBM) wrote:
> [DO NOT MERGE] [WORK-IN-PROGRESS]
> 
> Hello Jan,
> 
> This is an initial version of the patch set which I wanted to share
> before today's call. This is still work in progress but atleast passes
> the set of test cases which I had kept for dio testing (except 1 from my
> list).
> 
> Looks like there won't be much/any changes required from iomap side to
> support ext2 moving to iomap apis.
> 
> I will be doing some more testing specifically test generic/083 which is
> occassionally failing in my testing.
> Also once this is stabilized, I can do some performance testing too if you
> feel so. Last I remembered we saw some performance regressions when ext4
> moved to iomap for dio.
> 
> PS: Please ignore if there are some silly mistakes. As I said, I wanted
> to get this out before today's discussion. :)
> 
> Thanks for your help!!
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext2/ext2.h  |   1 +
>  fs/ext2/file.c  | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext2/inode.c |  20 +--------
>  3 files changed, 117 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index cb78d7dcfb95..cb5e309fe040 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
>  extern struct inode *ext2_iget (struct super_block *, unsigned long);
>  extern int ext2_write_inode (struct inode *, struct writeback_control *);
>  extern void ext2_evict_inode(struct inode *);
> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);
>  extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
>  extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
>  extern int ext2_getattr (struct mnt_idmap *, const struct path *,
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 6b4bebe982ca..7a8561304559 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	return ret;
>  }
> 
> +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file->f_mapping->host;
> +	ssize_t ret;
> +
> +	inode_lock_shared(inode);
> +	ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
> +	inode_unlock_shared(inode);
> +
> +	return ret;
> +}
> +
> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> +				 int error, unsigned int flags)
> +{
> +	loff_t pos = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error)
> +		return error;
> +

I guess you should carry over here relevant bits of the comment from
ext4_dio_write_end_io() explaining that doing i_size update here is
necessary and actually safe.

> +	pos += size;
> +	if (pos > i_size_read(inode))
> +		i_size_write(inode, pos);
> +
> +	return 0;
> +}
> +
> +static const struct iomap_dio_ops ext2_dio_write_ops = {
> +	.end_io = ext2_dio_write_end_io,
> +};
> +
> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file->f_mapping->host;
> +	ssize_t ret;
> +	unsigned int flags;
> +	unsigned long blocksize = inode->i_sb->s_blocksize;
> +	loff_t offset = iocb->ki_pos;
> +	loff_t count = iov_iter_count(from);
> +
> +
> +	inode_lock(inode);
> +	ret = generic_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out_unlock;
> +	ret = file_remove_privs(file);
> +	if (ret)
> +		goto out_unlock;
> +	ret = file_update_time(file);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/*
> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
> +	 * calls for generic_write_sync in iomap_dio_complete().
> +	 * Since ext2_fsync nmust be called w/o inode lock,
> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
> +	 * ourselves.
> +	 */
> +	flags = IOMAP_DIO_NOSYNC;

Meh, this is kind of ugly and we should come up with something better for
simple filesystems so that they don't have to play these games. Frankly,
these days I doubt there's anybody really needing inode_lock in
__generic_file_fsync(). Neither sync_mapping_buffers() nor
sync_inode_metadata() need inode_lock for their self-consistency. So it is
only about flushing more consistent set of metadata to disk when fsync(2)
races with other write(2)s to the same file so after a crash we have higher
chances of seeing some real state of the file. But I'm not sure it's really
worth keeping for filesystems that are still using sync_mapping_buffers().
People that care about consistency after a crash have IMHO moved to other
filesystems long ago.

> +
> +	/* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */
						  ^^ or

> +	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
> +	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
> +		flags |= IOMAP_DIO_FORCE_WAIT;
> +
> +	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
> +			   flags, NULL, 0);
> +
> +	if (ret == -ENOTBLK)
> +		ret = 0;

So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of
blockdev_direct_IO(). Thus you have to implement that in your
ext2_iomap_ops, in particular in iomap_begin...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-20 17:51         ` Jan Kara
@ 2023-03-22  6:34           ` Ritesh Harjani
  2023-03-23 11:30             ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani @ 2023-03-22  6:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc


Thanks Jan for your feedback!

Jan Kara <jack@suse.cz> writes:

> On Thu 16-03-23 20:10:29, Ritesh Harjani (IBM) wrote:
>> [DO NOT MERGE] [WORK-IN-PROGRESS]
>>
>> Hello Jan,
>>
>> This is an initial version of the patch set which I wanted to share
>> before today's call. This is still work in progress but atleast passes
>> the set of test cases which I had kept for dio testing (except 1 from my
>> list).
>>
>> Looks like there won't be much/any changes required from iomap side to
>> support ext2 moving to iomap apis.
>>
>> I will be doing some more testing specifically test generic/083 which is
>> occassionally failing in my testing.
>> Also once this is stabilized, I can do some performance testing too if you
>> feel so. Last I remembered we saw some performance regressions when ext4
>> moved to iomap for dio.
>>
>> PS: Please ignore if there are some silly mistakes. As I said, I wanted
>> to get this out before today's discussion. :)
>>
>> Thanks for your help!!
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/ext2/ext2.h  |   1 +
>>  fs/ext2/file.c  | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/ext2/inode.c |  20 +--------
>>  3 files changed, 117 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
>> index cb78d7dcfb95..cb5e309fe040 100644
>> --- a/fs/ext2/ext2.h
>> +++ b/fs/ext2/ext2.h
>> @@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
>>  extern struct inode *ext2_iget (struct super_block *, unsigned long);
>>  extern int ext2_write_inode (struct inode *, struct writeback_control *);
>>  extern void ext2_evict_inode(struct inode *);
>> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);
>>  extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
>>  extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
>>  extern int ext2_getattr (struct mnt_idmap *, const struct path *,
>> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
>> index 6b4bebe982ca..7a8561304559 100644
>> --- a/fs/ext2/file.c
>> +++ b/fs/ext2/file.c
>> @@ -161,12 +161,123 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>>  	return ret;
>>  }
>>
>> +static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>> +{
>> +	struct file *file = iocb->ki_filp;
>> +	struct inode *inode = file->f_mapping->host;
>> +	ssize_t ret;
>> +
>> +	inode_lock_shared(inode);
>> +	ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
>> +	inode_unlock_shared(inode);
>> +
>> +	return ret;
>> +}
>> +
>> +static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>> +				 int error, unsigned int flags)
>> +{
>> +	loff_t pos = iocb->ki_pos;
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> +	if (error)
>> +		return error;
>> +
>
> I guess you should carry over here relevant bits of the comment from
> ext4_dio_write_end_io() explaining that doing i_size update here is
> necessary and actually safe.
>

Yes, sure. Will do that in the next rev.

>> +	pos += size;
>> +	if (pos > i_size_read(inode))
>> +		i_size_write(inode, pos);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct iomap_dio_ops ext2_dio_write_ops = {
>> +	.end_io = ext2_dio_write_end_io,
>> +};
>> +
>> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> +{
>> +	struct file *file = iocb->ki_filp;
>> +	struct inode *inode = file->f_mapping->host;
>> +	ssize_t ret;
>> +	unsigned int flags;
>> +	unsigned long blocksize = inode->i_sb->s_blocksize;
>> +	loff_t offset = iocb->ki_pos;
>> +	loff_t count = iov_iter_count(from);
>> +
>> +
>> +	inode_lock(inode);
>> +	ret = generic_write_checks(iocb, from);
>> +	if (ret <= 0)
>> +		goto out_unlock;
>> +	ret = file_remove_privs(file);
>> +	if (ret)
>> +		goto out_unlock;
>> +	ret = file_update_time(file);
>> +	if (ret)
>> +		goto out_unlock;
>> +
>> +	/*
>> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
>> +	 * calls for generic_write_sync in iomap_dio_complete().
>> +	 * Since ext2_fsync nmust be called w/o inode lock,
>> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
>> +	 * ourselves.
>> +	 */
>> +	flags = IOMAP_DIO_NOSYNC;
>
> Meh, this is kind of ugly and we should come up with something better for
> simple filesystems so that they don't have to play these games. Frankly,
> these days I doubt there's anybody really needing inode_lock in
> __generic_file_fsync(). Neither sync_mapping_buffers() nor
> sync_inode_metadata() need inode_lock for their self-consistency. So it is
> only about flushing more consistent set of metadata to disk when fsync(2)
> races with other write(2)s to the same file so after a crash we have higher
> chances of seeing some real state of the file. But I'm not sure it's really
> worth keeping for filesystems that are still using sync_mapping_buffers().
> People that care about consistency after a crash have IMHO moved to other
> filesystems long ago.
>

One way which hch is suggesting is to use __iomap_dio_rw() -> unlock
inode -> call generic_write_sync(). I haven't yet worked on this part.
Are you suggesting to rip of inode_lock from __generic_file_fsync()?
Won't it have a much larger implications?


>> +
>> +	/* use IOMAP_DIO_FORCE_WAIT for unaligned of extending writes */
> 						  ^^ or
>

Yup. will fix it.

>> +	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
>> +	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
>> +		flags |= IOMAP_DIO_FORCE_WAIT;
>> +
>> +	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
>> +			   flags, NULL, 0);
>> +
>> +	if (ret == -ENOTBLK)
>> +		ret = 0;
>
> So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of
> blockdev_direct_IO(). Thus you have to implement that in your
> ext2_iomap_ops, in particular in iomap_begin...
>

Aah yes. Thanks for pointing that out -
ext2_iomap_begin() should have something like this -
	/*
	 * We cannot fill holes in indirect tree based inodes as that could
	 * expose stale data in the case of a crash. Use the magic error code
	 * to fallback to buffered I/O.
	 */

Also I think ext2_iomap_end() should also handle a case like in ext4 -

	/*
	 * Check to see whether an error occurred while writing out the data to
	 * the allocated blocks. If so, return the magic error code so that we
	 * fallback to buffered I/O and attempt to complete the remainder of
	 * the I/O. Any blocks that may have been allocated in preparation for
	 * the direct I/O will be reused during buffered I/O.
	 */
	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
		return -ENOTBLK;


I am wondering if we have testcases in xfstests which really tests these
functionalities also or not? Let me give it a try...
... So I did and somehow couldn't find any testcase which fails w/o
above changes.

I also ran LTP dio tests (./runltp -f dio -d /mnt1/scratch/tmp/) with the
current patches on ext2 and all the tests passed. So it seems LTP
doesn't have tests to trigger stale data exposure problems.

Do you know of any test case which could trigger this? Otherwise I can
finish updating the patches and then work on writing some test cases to
trigger this.

Another query -

We have this function ext2_iomap_end() (pasted below)
which calls ext2_write_failed().

Here IMO two cases are possible -

1. written is 0. which means an error has occurred.
In that case calling ext2_write_failed() make sense.

2. consider a case where written > 0 && written < length.
(This is possible right?). In that case we still go and call
ext2_write_failed(). This function will truncate the pagecache and disk
blocks beyong i_size. Now we haven't yet updated inode->i_size (we do
that in ->end_io which gets called in the end during completion)
So that means it just removes everything.

Then in ext2_dax_write_iter(), we might go and update inode->i_size
to iocb->ki_pos including for short writes. This looks like it isn't
consistent because earlier we had destroyed all the blocks for the short
writes and we will be returning ret > 0 to the user saying these many
bytes have been written.
Again I haven't yet found a test case at least not in xfstests which
can trigger this short writes. Let me know your thoughts on this.
All of this lies on the fact that there can be a case where
written > 0 && written < length. I will read more to see if this even
happens or not. But I atleast wanted to capture this somewhere.

static int
ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
		ssize_t written, unsigned flags, struct iomap *iomap)
{
	if (iomap->type == IOMAP_MAPPED &&
	    written < length &&
	    (flags & IOMAP_WRITE))
		ext2_write_failed(inode->i_mapping, offset + length);
	return 0;
}

void ext2_write_failed(struct address_space *mapping, loff_t to)
{
	struct inode *inode = mapping->host;

	if (to > inode->i_size) {
		truncate_pagecache(inode, inode->i_size);
		ext2_truncate_blocks(inode, inode->i_size);
	}
}

ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
<...>
	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
		i_size_write(inode, iocb->ki_pos);
		mark_inode_dirty(inode);
	}
<...>


Another thing -
In dax while truncating the inode i_size in ext2_setsize(),
I think we don't properly call dax_zero_blocks() when we are trying to
zero the last block beyond EOF. i.e. for e.g. it can be called with len
as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with
len = 0 and can bug_on at maxblocks == 0.

I think it should be this. I will spend some more time analyzing this
and also test it once against DAX paths.

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7ff669d0b6d2..cc264b1e288c 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
        inode_dio_wait(inode);

        if (IS_DAX(inode))
-               error = dax_zero_range(inode, newsize,
-                                      PAGE_ALIGN(newsize) - newsize, NULL,
-                                      &ext2_iomap_ops);
+               error = dax_truncate_page(inode, newsize, NULL,
+                                         &ext2_iomap_ops);
        else
                error = block_truncate_page(inode->i_mapping,
                                newsize, ext2_get_block);

-ritesh


> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-22  6:34           ` Ritesh Harjani
@ 2023-03-23 11:30             ` Jan Kara
  2023-03-23 13:19               ` Ritesh Harjani
  2023-03-30  0:02               ` Christoph Hellwig
  0 siblings, 2 replies; 23+ messages in thread
From: Jan Kara @ 2023-03-23 11:30 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

On Wed 22-03-23 12:04:01, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> >> +	pos += size;
> >> +	if (pos > i_size_read(inode))
> >> +		i_size_write(inode, pos);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static const struct iomap_dio_ops ext2_dio_write_ops = {
> >> +	.end_io = ext2_dio_write_end_io,
> >> +};
> >> +
> >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >> +{
> >> +	struct file *file = iocb->ki_filp;
> >> +	struct inode *inode = file->f_mapping->host;
> >> +	ssize_t ret;
> >> +	unsigned int flags;
> >> +	unsigned long blocksize = inode->i_sb->s_blocksize;
> >> +	loff_t offset = iocb->ki_pos;
> >> +	loff_t count = iov_iter_count(from);
> >> +
> >> +
> >> +	inode_lock(inode);
> >> +	ret = generic_write_checks(iocb, from);
> >> +	if (ret <= 0)
> >> +		goto out_unlock;
> >> +	ret = file_remove_privs(file);
> >> +	if (ret)
> >> +		goto out_unlock;
> >> +	ret = file_update_time(file);
> >> +	if (ret)
> >> +		goto out_unlock;
> >> +
> >> +	/*
> >> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
> >> +	 * calls for generic_write_sync in iomap_dio_complete().
> >> +	 * Since ext2_fsync nmust be called w/o inode lock,
> >> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
> >> +	 * ourselves.
> >> +	 */
> >> +	flags = IOMAP_DIO_NOSYNC;
> >
> > Meh, this is kind of ugly and we should come up with something better for
> > simple filesystems so that they don't have to play these games. Frankly,
> > these days I doubt there's anybody really needing inode_lock in
> > __generic_file_fsync(). Neither sync_mapping_buffers() nor
> > sync_inode_metadata() need inode_lock for their self-consistency. So it is
> > only about flushing more consistent set of metadata to disk when fsync(2)
> > races with other write(2)s to the same file so after a crash we have higher
> > chances of seeing some real state of the file. But I'm not sure it's really
> > worth keeping for filesystems that are still using sync_mapping_buffers().
> > People that care about consistency after a crash have IMHO moved to other
> > filesystems long ago.
> >
> 
> One way which hch is suggesting is to use __iomap_dio_rw() -> unlock
> inode -> call generic_write_sync(). I haven't yet worked on this part.

So I see two problems with what Christoph suggests:

a) It is unfortunate API design to require trivial (and low maintenance)
   filesystem to do these relatively complex locking games. But this can
   be solved by providing appropriate wrapper for them I guess.

b) When you unlock the inode, other stuff can happen with the inode. And
   e.g. i_size update needs to happen after IO is completed so filesystems
   would have to be taught to avoid say two racing expanding writes. That's
   IMHO really too much to ask.

> Are you suggesting to rip of inode_lock from __generic_file_fsync()?
> Won't it have a much larger implications?

Yes and yes :). But inode writeback already happens from other paths
without inode_lock so there's hardly any surprise there.
sync_mapping_buffers() is impossible to "customize" by filesystems and the
generic code is fine without inode_lock. So I have hard time imagining how
any filesystem would really depend on inode_lock in this path (famous last
words ;)).

> >> +	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
> >> +	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
> >> +		flags |= IOMAP_DIO_FORCE_WAIT;
> >> +
> >> +	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
> >> +			   flags, NULL, 0);
> >> +
> >> +	if (ret == -ENOTBLK)
> >> +		ret = 0;
> >
> > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of
> > blockdev_direct_IO(). Thus you have to implement that in your
> > ext2_iomap_ops, in particular in iomap_begin...
> >
> 
> Aah yes. Thanks for pointing that out -
> ext2_iomap_begin() should have something like this -
> 	/*
> 	 * We cannot fill holes in indirect tree based inodes as that could
> 	 * expose stale data in the case of a crash. Use the magic error code
> 	 * to fallback to buffered I/O.
> 	 */
> 
> Also I think ext2_iomap_end() should also handle a case like in ext4 -
> 
> 	/*
> 	 * Check to see whether an error occurred while writing out the data to
> 	 * the allocated blocks. If so, return the magic error code so that we
> 	 * fallback to buffered I/O and attempt to complete the remainder of
> 	 * the I/O. Any blocks that may have been allocated in preparation for
> 	 * the direct I/O will be reused during buffered I/O.
> 	 */
> 	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> 		return -ENOTBLK;
> 
> 
> I am wondering if we have testcases in xfstests which really tests these
> functionalities also or not? Let me give it a try...
> ... So I did and somehow couldn't find any testcase which fails w/o
> above changes.

I guess we don't. It isn't that simple (but certainly possible) to test for
stale data exposure...

> Another query -
> 
> We have this function ext2_iomap_end() (pasted below)
> which calls ext2_write_failed().
> 
> Here IMO two cases are possible -
> 
> 1. written is 0. which means an error has occurred.
> In that case calling ext2_write_failed() make sense.
> 
> 2. consider a case where written > 0 && written < length.
> (This is possible right?). In that case we still go and call
> ext2_write_failed(). This function will truncate the pagecache and disk
> blocks beyong i_size. Now we haven't yet updated inode->i_size (we do
> that in ->end_io which gets called in the end during completion)
> So that means it just removes everything.
> 
> Then in ext2_dax_write_iter(), we might go and update inode->i_size
> to iocb->ki_pos including for short writes. This looks like it isn't
> consistent because earlier we had destroyed all the blocks for the short
> writes and we will be returning ret > 0 to the user saying these many
> bytes have been written.
> Again I haven't yet found a test case at least not in xfstests which
> can trigger this short writes. Let me know your thoughts on this.
> All of this lies on the fact that there can be a case where
> written > 0 && written < length. I will read more to see if this even
> happens or not. But I atleast wanted to capture this somewhere.

So as far as I remember, direct IO writes as implemented in iomap are
all-or-nothing (see iomap_dio_complete()). But it would be good to assert
that in ext4 code to avoid surprises if the generic code changes.

> Another thing -
> In dax while truncating the inode i_size in ext2_setsize(),
> I think we don't properly call dax_zero_blocks() when we are trying to
> zero the last block beyond EOF. i.e. for e.g. it can be called with len
> as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with
> len = 0 and can bug_on at maxblocks == 0.

How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will
not call iomap_begin() at all if iter.len == 0.

> I think it should be this. I will spend some more time analyzing this
> and also test it once against DAX paths.
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 7ff669d0b6d2..cc264b1e288c 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
>         inode_dio_wait(inode);
> 
>         if (IS_DAX(inode))
> -               error = dax_zero_range(inode, newsize,
> -                                      PAGE_ALIGN(newsize) - newsize, NULL,
> -                                      &ext2_iomap_ops);
> +               error = dax_truncate_page(inode, newsize, NULL,
> +                                         &ext2_iomap_ops);
>         else
>                 error = block_truncate_page(inode->i_mapping,
>                                 newsize, ext2_get_block);

That being said this is indeed a nice cleanup.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-23 11:30             ` Jan Kara
@ 2023-03-23 13:19               ` Ritesh Harjani
  2023-03-30  0:02               ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Ritesh Harjani @ 2023-03-23 13:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

Jan Kara <jack@suse.cz> writes:

Hi Jan,

> On Wed 22-03-23 12:04:01, Ritesh Harjani wrote:
>> Jan Kara <jack@suse.cz> writes:
>> >> +	pos += size;
>> >> +	if (pos > i_size_read(inode))
>> >> +		i_size_write(inode, pos);
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +static const struct iomap_dio_ops ext2_dio_write_ops = {
>> >> +	.end_io = ext2_dio_write_end_io,
>> >> +};
>> >> +
>> >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >> +{
>> >> +	struct file *file = iocb->ki_filp;
>> >> +	struct inode *inode = file->f_mapping->host;
>> >> +	ssize_t ret;
>> >> +	unsigned int flags;
>> >> +	unsigned long blocksize = inode->i_sb->s_blocksize;
>> >> +	loff_t offset = iocb->ki_pos;
>> >> +	loff_t count = iov_iter_count(from);
>> >> +
>> >> +
>> >> +	inode_lock(inode);
>> >> +	ret = generic_write_checks(iocb, from);
>> >> +	if (ret <= 0)
>> >> +		goto out_unlock;
>> >> +	ret = file_remove_privs(file);
>> >> +	if (ret)
>> >> +		goto out_unlock;
>> >> +	ret = file_update_time(file);
>> >> +	if (ret)
>> >> +		goto out_unlock;
>> >> +
>> >> +	/*
>> >> +	 * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
>> >> +	 * calls for generic_write_sync in iomap_dio_complete().
>> >> +	 * Since ext2_fsync nmust be called w/o inode lock,
>> >> +	 * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
>> >> +	 * ourselves.
>> >> +	 */
>> >> +	flags = IOMAP_DIO_NOSYNC;
>> >
>> > Meh, this is kind of ugly and we should come up with something better for
>> > simple filesystems so that they don't have to play these games. Frankly,
>> > these days I doubt there's anybody really needing inode_lock in
>> > __generic_file_fsync(). Neither sync_mapping_buffers() nor
>> > sync_inode_metadata() need inode_lock for their self-consistency. So it is
>> > only about flushing more consistent set of metadata to disk when fsync(2)
>> > races with other write(2)s to the same file so after a crash we have higher
>> > chances of seeing some real state of the file. But I'm not sure it's really
>> > worth keeping for filesystems that are still using sync_mapping_buffers().
>> > People that care about consistency after a crash have IMHO moved to other
>> > filesystems long ago.
>> >
>>
>> One way which hch is suggesting is to use __iomap_dio_rw() -> unlock
>> inode -> call generic_write_sync(). I haven't yet worked on this part.
>
> So I see two problems with what Christoph suggests:
>
> a) It is unfortunate API design to require trivial (and low maintenance)
>    filesystem to do these relatively complex locking games. But this can
>    be solved by providing appropriate wrapper for them I guess.
>
> b) When you unlock the inode, other stuff can happen with the inode. And
>    e.g. i_size update needs to happen after IO is completed so filesystems
>    would have to be taught to avoid say two racing expanding writes. That's
>    IMHO really too much to ask.
>

yes, that's the reason I was not touching it and I thought I will
get back to it once I figure out other things.


>> Are you suggesting to rip of inode_lock from __generic_file_fsync()?
>> Won't it have a much larger implications?
>
> Yes and yes :). But inode writeback already happens from other paths
> without inode_lock so there's hardly any surprise there.
> sync_mapping_buffers() is impossible to "customize" by filesystems and the
> generic code is fine without inode_lock. So I have hard time imagining how
> any filesystem would really depend on inode_lock in this path (famous last
> words ;)).
>

Ok sure. I will spend sometime looking into this code and history. And
if everything looks good, will rip off inode_lock() from __generic_file_fsync().


>> >> +	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
>> >> +	   (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
>> >> +		flags |= IOMAP_DIO_FORCE_WAIT;
>> >> +
>> >> +	ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
>> >> +			   flags, NULL, 0);
>> >> +
>> >> +	if (ret == -ENOTBLK)
>> >> +		ret = 0;
>> >
>> > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of
>> > blockdev_direct_IO(). Thus you have to implement that in your
>> > ext2_iomap_ops, in particular in iomap_begin...
>> >
>>
>> Aah yes. Thanks for pointing that out -
>> ext2_iomap_begin() should have something like this -
>> 	/*
>> 	 * We cannot fill holes in indirect tree based inodes as that could
>> 	 * expose stale data in the case of a crash. Use the magic error code
>> 	 * to fallback to buffered I/O.
>> 	 */
>>
>> Also I think ext2_iomap_end() should also handle a case like in ext4 -
>>
>> 	/*
>> 	 * Check to see whether an error occurred while writing out the data to
>> 	 * the allocated blocks. If so, return the magic error code so that we
>> 	 * fallback to buffered I/O and attempt to complete the remainder of
>> 	 * the I/O. Any blocks that may have been allocated in preparation for
>> 	 * the direct I/O will be reused during buffered I/O.
>> 	 */
>> 	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
>> 		return -ENOTBLK;
>>
>>
>> I am wondering if we have testcases in xfstests which really tests these
>> functionalities also or not? Let me give it a try...
>> ... So I did and somehow couldn't find any testcase which fails w/o
>> above changes.
>
> I guess we don't. It isn't that simple (but certainly possible) to test for
> stale data exposure...
>

Yes, I am currently working on writing a aio dio test case to simulate
this. AFAIU, the stale data exposure problems with non-extent filesystem
is because it doesn't support unwritten extents. So in such case, if
we submit an aio-dio we on a hole (inside file i_size), then a racing
buffered read can race with it. That is, if the block allocation
happens via aio-dio then the buffered read can read stale data from
that block which is within inode i_size.

This is not a problem for extending file writes because we anyway update
file i_size after the write is done (and aio-dio is also synchronous
against extending file writes but that's is to avoid race against other
aio-dio).

And this is not a problem for fileystems which supports unwritten extent
because we can always allocate an unwritten extent inside a hole and then a
racing buffered read cannot read stale data (because unwritten extents returns 0).
Then unwritten to written conversion can happen at the end once the
data is already written to these blocks.


>> Another query -
>>
>> We have this function ext2_iomap_end() (pasted below)
>> which calls ext2_write_failed().
>>
>> Here IMO two cases are possible -
>>
>> 1. written is 0. which means an error has occurred.
>> In that case calling ext2_write_failed() make sense.
>>
>> 2. consider a case where written > 0 && written < length.
>> (This is possible right?). In that case we still go and call
>> ext2_write_failed(). This function will truncate the pagecache and disk
>> blocks beyong i_size. Now we haven't yet updated inode->i_size (we do
>> that in ->end_io which gets called in the end during completion)
>> So that means it just removes everything.
>>
>> Then in ext2_dax_write_iter(), we might go and update inode->i_size
>> to iocb->ki_pos including for short writes. This looks like it isn't
>> consistent because earlier we had destroyed all the blocks for the short
>> writes and we will be returning ret > 0 to the user saying these many
>> bytes have been written.
>> Again I haven't yet found a test case at least not in xfstests which
>> can trigger this short writes. Let me know your thoughts on this.
>> All of this lies on the fact that there can be a case where
>> written > 0 && written < length. I will read more to see if this even
>> happens or not. But I atleast wanted to capture this somewhere.
>
> So as far as I remember, direct IO writes as implemented in iomap are
> all-or-nothing (see iomap_dio_complete()). But it would be good to assert
> that in ext4 code to avoid surprises if the generic code changes.
>

Ok, I was talking about error paths. But I think I will park this one to
go through later.

>> Another thing -
>> In dax while truncating the inode i_size in ext2_setsize(),
>> I think we don't properly call dax_zero_blocks() when we are trying to
>> zero the last block beyond EOF. i.e. for e.g. it can be called with len
>> as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with
>> len = 0 and can bug_on at maxblocks == 0.
>
> How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will
> not call iomap_begin() at all if iter.len == 0.
>

It can. In dax_zero_range() if we pass len = 0, then iomap_iter() can get
can get called with 0 len and it will call ->iomap_begin() with 0
len.
Maybe a WARN_ON() would certainly help in iomap_iter() to check against 0
iter->len to be passed to ->iomap_begin()

(Note iter->len and iter->iomap.length are different)

>> I think it should be this. I will spend some more time analyzing this
>> and also test it once against DAX paths.
>>
>> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
>> index 7ff669d0b6d2..cc264b1e288c 100644
>> --- a/fs/ext2/inode.c
>> +++ b/fs/ext2/inode.c
>> @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
>>         inode_dio_wait(inode);
>>
>>         if (IS_DAX(inode))
>> -               error = dax_zero_range(inode, newsize,
>> -                                      PAGE_ALIGN(newsize) - newsize, NULL,
>> -                                      &ext2_iomap_ops);
>> +               error = dax_truncate_page(inode, newsize, NULL,
>> +                                         &ext2_iomap_ops);
>>         else
>>                 error = block_truncate_page(inode->i_mapping,
>>                                 newsize, ext2_get_block);
>
> That being said this is indeed a nice cleanup.

Sure. I will add this patch in the beginning of the next revision.

-ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
  2023-03-23 11:30             ` Jan Kara
  2023-03-23 13:19               ` Ritesh Harjani
@ 2023-03-30  0:02               ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-03-30  0:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ritesh Harjani, linux-fsdevel, linux-mm, lsf-pc

On Thu, Mar 23, 2023 at 12:30:30PM +0100, Jan Kara wrote:
> > 
> > One way which hch is suggesting is to use __iomap_dio_rw() -> unlock
> > inode -> call generic_write_sync(). I haven't yet worked on this part.
> 
> So I see two problems with what Christoph suggests:
> 
> a) It is unfortunate API design to require trivial (and low maintenance)
>    filesystem to do these relatively complex locking games. But this can
>    be solved by providing appropriate wrapper for them I guess.

Well, the problem is that this "trivial" file systems have a pretty
broken locking scheme for fsync.

The legacy dio code gets around this by unlocking i_rwsem inside of
__blockdev_direct_IO.  Which force a particular and somewhat odd
locking scheme on the callers, and severly limits it what it can do.

> > Are you suggesting to rip of inode_lock from __generic_file_fsync()?
> > Won't it have a much larger implications?
> 
> Yes and yes :). But inode writeback already happens from other paths
> without inode_lock so there's hardly any surprise there.
> sync_mapping_buffers() is impossible to "customize" by filesystems and the
> generic code is fine without inode_lock. So I have hard time imagining how
> any filesystem would really depend on inode_lock in this path (famous last
> words ;)).

Not holding the inode lock in ->fsync would solve all of this indeed.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-03-30  0:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-29  4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain
2023-01-29  5:06 ` Matthew Wilcox
2023-01-29  5:39   ` Luis Chamberlain
2023-02-08 16:04   ` Jan Kara
2023-02-24  7:01     ` Zhang Yi
2023-02-26 20:16     ` Ritesh Harjani
2023-03-16 14:40       ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM)
2023-03-16 15:41         ` Darrick J. Wong
2023-03-20 16:11           ` Ritesh Harjani
2023-03-20 13:15         ` Christoph Hellwig
2023-03-20 17:51         ` Jan Kara
2023-03-22  6:34           ` Ritesh Harjani
2023-03-23 11:30             ` Jan Kara
2023-03-23 13:19               ` Ritesh Harjani
2023-03-30  0:02               ` Christoph Hellwig
2023-02-27 19:26     ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong
2023-02-27 21:02       ` Matthew Wilcox
2023-02-27 19:47   ` Darrick J. Wong
2023-02-27 20:24     ` Luis Chamberlain
2023-02-27 19:06 ` Darrick J. Wong
2023-02-27 19:58   ` Luis Chamberlain
2023-03-01 16:59 ` Ritesh Harjani
2023-03-01 17:08   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).