From: Junxiao Bi <junxiao.bi@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: cluster-devel <cluster-devel@redhat.com>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andreas Gruenbacher <agruenba@redhat.com>,
ocfs2-devel@oss.oracle.com
Subject: Re: [Ocfs2-devel] [Cluster-devel] [PATCH 1/3] fs/buffer.c: add new api to allow eof writeback
Date: Mon, 3 May 2021 10:25:31 -0700 [thread overview]
Message-ID: <72cde802-bd8a-9ce5-84d7-57b34a6a8b03@oracle.com> (raw)
In-Reply-To: <20210503102904.GC2994@quack2.suse.cz>
On 5/3/21 3:29 AM, Jan Kara wrote:
> On Fri 30-04-21 14:18:15, Junxiao Bi wrote:
>> On 4/30/21 5:47 AM, Jan Kara wrote:
>>
>>> On Thu 29-04-21 11:07:15, Junxiao Bi wrote:
>>>> On 4/29/21 10:14 AM, Andreas Gruenbacher wrote:
>>>>> On Tue, Apr 27, 2021 at 4:44 AM Junxiao Bi <junxiao.bi@oracle.com> wrote:
>>>>>> When doing truncate/fallocate for some filesytem like ocfs2, it
>>>>>> will zero some pages that are out of inode size and then later
>>>>>> update the inode size, so it needs this api to writeback eof
>>>>>> pages.
>>>>> is this in reaction to Jan's "[PATCH 0/12 v4] fs: Hole punch vs page
>>>>> cache filling races" patch set [*]? It doesn't look like the kind of
>>>>> patch Christoph would be happy with.
>>>> Thank you for pointing the patch set. I think that is fixing a different
>>>> issue.
>>>>
>>>> The issue here is when extending file size with fallocate/truncate, if the
>>>> original inode size
>>>>
>>>> is in the middle of the last cluster block(1M), eof part will be zeroed with
>>>> buffer write first,
>>>>
>>>> and then new inode size is updated, so there is a window that dirty pages is
>>>> out of inode size,
>>>>
>>>> if writeback is kicked in, block_write_full_page will drop all those eof
>>>> pages.
>>> I agree that the buffers describing part of the cluster beyond i_size won't
>>> be written. But page cache will remain zeroed out so that is fine. So you
>>> only need to zero out the on disk contents. Since this is actually
>>> physically contiguous range of blocks why don't you just use
>>> sb_issue_zeroout() to zero out the tail of the cluster? It will be more
>>> efficient than going through the page cache and you also won't have to
>>> tweak block_write_full_page()...
>> Thanks for the review.
>>
>> The physical blocks to be zeroed were continuous only when sparse mode is
>> enabled, if sparse mode is disabled, unwritten extent was not supported for
>> ocfs2, then all the blocks to the new size will be zeroed by the buffer
>> write, since sb_issue_zeroout() will need waiting io done, there will be a
>> lot of delay when extending file size. Use writeback to flush async seemed
>> more efficient?
> It depends. Higher end storage (e.g. NVME or NAS, maybe some better SATA
> flash disks as well) do support WRITE_ZERO command so you don't actually
> have to write all those zeros. The storage will just internally mark all
> those blocks as having zeros. This is rather fast so I'd expect the overall
> result to be faster that zeroing page cache and then writing all those
> pages with zeroes on transaction commit. But I agree that for lower end
> storage this may be slower because of synchronous writing of zeroes. That
> being said your transaction commit has to write those zeroes anyway so the
> cost is only mostly shifted but it could still make a difference for some
> workloads. Not sure if that matters, that is your call I'd say.
Ocfs2 is mostly used with SAN, i don't think it's common for SAN storage
to support WRITE_ZERO command.
Anything bad to add a new api to support eof writeback?
Thanks,
Junxiao.
>
> Also note that you could submit those zeroing bios asynchronously but that
> would be more coding and you need to make sure they are completed on
> transaction commit so probably it isn't worth the complexity.
>
> Honza
_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel
next prev parent reply other threads:[~2021-05-03 17:33 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-26 22:05 [Ocfs2-devel] [PATCH 1/3] fs/buffer.c: add new api to allow eof writeback Junxiao Bi
2021-04-26 22:05 ` [Ocfs2-devel] [PATCH 2/3] ocfs2: allow writing back pages out of inode size Junxiao Bi
2021-04-28 16:00 ` Junxiao Bi
2021-04-29 13:09 ` Joseph Qi
2021-04-26 22:05 ` [Ocfs2-devel] [PATCH 3/3] gfs2: fix out of inode size writeback Junxiao Bi
2021-04-28 16:02 ` Junxiao Bi
2021-04-29 11:58 ` [Ocfs2-devel] [PATCH 1/3] fs/buffer.c: add new api to allow eof writeback Joseph Qi
2021-04-29 17:14 ` [Ocfs2-devel] [Cluster-devel] " Andreas Gruenbacher
2021-04-29 18:07 ` Junxiao Bi
2021-04-30 12:47 ` Jan Kara
2021-04-30 21:18 ` Junxiao Bi
2021-05-03 10:29 ` Jan Kara
2021-05-03 17:25 ` Junxiao Bi [this message]
2021-05-04 9:02 ` Jan Kara
2021-05-04 23:35 ` Junxiao Bi
2021-05-05 11:43 ` Jan Kara
2021-05-05 15:54 ` Junxiao Bi
2021-05-09 23:23 ` [Ocfs2-devel] " Andrew Morton
2021-05-10 22:15 ` Junxiao Bi
2021-05-11 12:19 ` Bob Peterson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=72cde802-bd8a-9ce5-84d7-57b34a6a8b03@oracle.com \
--to=junxiao.bi@oracle.com \
--cc=agruenba@redhat.com \
--cc=cluster-devel@redhat.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=ocfs2-devel@oss.oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).