* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
@ 2023-02-21 23:36 ` Andrew Morton
2023-02-24 1:33 ` pr-tracker-bot
` (7 subsequent siblings)
8 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2023-02-21 23:36 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-mm, mm-commits, linux-kernel, Andreas Gruenbacher
On Mon, 20 Feb 2023 13:52:25 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> There are a lot of conflicts in your future, mainly because of the
> ongoing folio conversion work. This will hopefully come to an end
> fairly soon. Forthcoming conflicts which are known about, along with
> Stephen's fixes are:
> ...
And... I failed to mention a conflict which didn't generate a reject,
thanks to a post-6.2-rc4 gfs2 patch.
Stephen's fix for this is at
https://lkml.kernel.org/r/20230127173638.1efbe423@canb.auug.org.au
Or,
From: Andrew Morton <akpm@linux-foundation.org>
Subject: fs/gfs2/log.c: fix build in __gfs2_writepage()
Date: Tue Feb 21 03:23:08 PM PST 2023
mm-stable was based on 6.2-rc4 and hence its patch d585bdbeb79a ("fs:
convert writepage_t callback to pass a folio") didn't know about the
post-rc4 95ecbd0f162f ("Revert "gfs2: stop using generic_writepages in
gfs2_ail1_start_one"").
Net result is that d585bdbeb79a failed to convert fs/gfs2/log.c. The fix
is from Andreas.
Fixes: d585bdbeb79a ("fs: convert writepage_t callback to pass a folio")
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lkml.kernel.org/r/20230203105401.3362277-1-agruenba@redhat.com
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
--- a/fs/gfs2/log.c~fs-gfs2-logc-fix-build-in-__gfs2_writepage
+++ a/fs/gfs2/log.c
@@ -80,11 +80,11 @@ void gfs2_remove_from_ail(struct gfs2_bu
brelse(bd->bd_bh);
}
-static int __gfs2_writepage(struct page *page, struct writeback_control *wbc,
- void *data)
+static int __gfs2_writepage(struct folio *folio, struct writeback_control *wbc,
+ void *data)
{
struct address_space *mapping = data;
- int ret = mapping->a_ops->writepage(page, wbc);
+ int ret = mapping->a_ops->writepage(&folio->page, wbc);
mapping_set_error(mapping, ret);
return ret;
}
_
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
2023-02-21 23:36 ` Andrew Morton
@ 2023-02-24 1:33 ` pr-tracker-bot
2023-02-24 1:33 ` Linus Torvalds
` (6 subsequent siblings)
8 siblings, 0 replies; 33+ messages in thread
From: pr-tracker-bot @ 2023-02-24 1:33 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, linux-mm, mm-commits, linux-kernel
The pull request you sent on Mon, 20 Feb 2023 13:52:25 -0800:
> https://lkml.kernel.org/r/20230106125915.60c8b547@canb.auug.org.au drivers/infiniband/hw/hfi1/file_ops.c
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/3822a7c40997dc86b1458766a3f146d62393f084
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
2023-02-21 23:36 ` Andrew Morton
2023-02-24 1:33 ` pr-tracker-bot
@ 2023-02-24 1:33 ` Linus Torvalds
2023-02-24 1:56 ` Andrew Morton
2023-02-24 3:01 ` Huang, Ying
2023-02-24 9:04 ` David Howells
` (5 subsequent siblings)
8 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 1:33 UTC (permalink / raw)
To: Andrew Morton, Jan Kara, Vishal Moola, Paulo Alcantara,
Matthew Wilcox, David Howells, Steve French, Huang Ying,
Baolin Wang, Xin Hao
Cc: linux-mm, mm-commits, linux-kernel
On Mon, Feb 20, 2023 at 1:52 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Linus, please merge this cycle's MM updates.
Ok, finally got around to this one, and I am not thrilled about it.
A few notes:
- my fs/udf/inode.c is "minimal". Ugly, ugly, I think
udf_adinicb_writepage() should just be made to actually deal with
folios inherently, but I did *NOT* do that, and just did
struct page *page = &folio->page;
at the top, and left it at that. I'm not proud of it, but I hope
Jan might want to do this properly.
That conflict wasn't mentioned, and now I wonder if the UDF changes
were in -next at all?
- the fs/cifs/file.c conflict was a nightmare.
Yes, I saw David Howells resolution suggestion. I think that one
was buggy. It would wait for a page under writeback, and then go on to
the *next* one without writing it back. I don't thin kthat was right.
That said, I'm not at all convinced my version is right either. I
can't test it, and that means I probably messed up. It looked sane to
me when I did it, and it builds cleanly, but I honestly doubt myself.
I think that code should probably use xas_for_each_marked(), but
instead I kept the old model, added a loop like DavidH did, and then
munged the end result until I thought it was palatable.
NOTE! Don't even try to look at the conflict diff. It makes no
sense at all. But somebody (I'd hope all of DavidH, SteveF and Willy)
should really take a look at my end result.
- gcc 12.2.1 quite reasonable complains about some of the new MM code:
mm/migrate.c: In function ‘__migrate_folio_extract’:
mm/migrate.c:1050:20: note: randstruct: casting between randomized
structure pointer types (ssa): ‘struct anon_vma’ and ‘struct
address_space’
1050 | *anon_vmap = (void *)dst->mapping;
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
and while this doesn't cause a build failure ("note" is different
from "warning"), I do think something needs to be done. Gcc is right.
This code seems to *work* simply because it's intentionally
mis-casting pointers, but I think it needs to be seriously looked at
and something done to make gcc happy (and a *LARGE* comment about it).
That last note is not some merge result, it's purely about the new MM code.
Anyway, the merge is done and pushed out, I just am not very confident
about the end result at all. That cifs thing really needs somebody
competent looking at it.
I think I went through three different iterations of my resolution
before I was happy with my approach, and "happy" may end up being more
about having exhausted my will to live, than about the end result
actually being any good.
I saw some noise about ext4 being a nightmare too, but I haven't
gotten that pull request yet.
I'll tackle the non-MM pull next, but I'm taking a break first.
Alcohol may have to be involved.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-24 1:33 ` Linus Torvalds
@ 2023-02-24 1:56 ` Andrew Morton
2023-02-24 3:01 ` Huang, Ying
1 sibling, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2023-02-24 1:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jan Kara, Vishal Moola, Paulo Alcantara, Matthew Wilcox,
David Howells, Steve French, Huang Ying, Baolin Wang, Xin Hao,
linux-mm, mm-commits, linux-kernel
On Thu, 23 Feb 2023 17:33:37 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, Feb 20, 2023 at 1:52 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Linus, please merge this cycle's MM updates.
>
> Ok, finally got around to this one, and I am not thrilled about it.
>
> A few notes:
>
> - my fs/udf/inode.c is "minimal". Ugly, ugly, I think
> udf_adinicb_writepage() should just be made to actually deal with
> folios inherently, but I did *NOT* do that, and just did
>
> struct page *page = &folio->page;
>
> at the top, and left it at that. I'm not proud of it, but I hope
> Jan might want to do this properly.
>
> That conflict wasn't mentioned, and now I wonder if the UDF changes
> were in -next at all?
This was a mea culpa, sorry. Stephen did encounter and resolve this
(https://lkml.kernel.org/r/20230127165912.0e4a7b66@canb.auug.org.au)
but I was fixated on his "linux-next: manual merge of the mm-stable
tree with the XXX tree" emails and not his "linux-next: build failure after
merge of the mm tree" emails. Self-LART has been applied.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-24 1:33 ` Linus Torvalds
2023-02-24 1:56 ` Andrew Morton
@ 2023-02-24 3:01 ` Huang, Ying
1 sibling, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2023-02-24 3:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jan Kara, Vishal Moola, Paulo Alcantara,
Matthew Wilcox, David Howells, Steve French, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> - gcc 12.2.1 quite reasonable complains about some of the new MM code:
>
> mm/migrate.c: In function ‘__migrate_folio_extract’:
> mm/migrate.c:1050:20: note: randstruct: casting between randomized
> structure pointer types (ssa): ‘struct anon_vma’ and ‘struct
> address_space’
>
> 1050 | *anon_vmap = (void *)dst->mapping;
> | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
>
> and while this doesn't cause a build failure ("note" is different
> from "warning"), I do think something needs to be done. Gcc is right.
> This code seems to *work* simply because it's intentionally
> mis-casting pointers,
Yes. The mis-casting is intentional. I just need some place to hold
the data temporarily (save in __migrate_folio_record() and clear in
__migrate_folio_extract()). And "dst" is newly allocated folio.
> but I think it needs to be seriously looked at and something done to
> make gcc happy (and a *LARGE* comment about it).
Sure. I will check whether there's some way to make gcc happy and add
some comments about that. There's some comments for
__migrate_folio_extract(), but that's isn't enough apprently.)
> That last note is not some merge result, it's purely about the new MM code.
>
[snip]
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
` (2 preceding siblings ...)
2023-02-24 1:33 ` Linus Torvalds
@ 2023-02-24 9:04 ` David Howells
2023-02-24 12:12 ` David Howells
` (4 subsequent siblings)
8 siblings, 0 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 9:04 UTC (permalink / raw)
To: Linus Torvalds, Vishal Moola, Steve French
Cc: dhowells, Andrew Morton, Jan Kara, Paulo Alcantara,
Matthew Wilcox, Huang Ying, Baolin Wang, Xin Hao, linux-mm,
mm-commits, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Yes, I saw David Howells resolution suggestion. I think that one
> was buggy. It would wait for a page under writeback, and then go on to
> the *next* one without writing it back. I don't thin kthat was right.
You're right. Vishal's patch introduced it into afs and I copied it across
and didn't notice it either then or on review of Vishal's patch. He inserted
the extra for-loop as he's now extracting a batch, but kept the continue that
used to repeat the extraction - except now it continues the wrong loop.
So afs will need fixing too. The simplest ways I think are to just decrement
the loop counter before continuing or to stick a goto in back to the beginning
of the loop (which is what you did in cifs). But I'm not sure that's the
correct thing to do. The previous code dropped the found folio and then
repeated the search in case the folio got truncated, migrated or punched. I
suspect that's probably what we should do.
Also, thinking about it again, I'm not sure whether fetching a batch with
filemap_get_folios_tag() like this in {afs,cifs}_writepages_region() is
necessarily the right thing to do. There are three cases I'm thinking of:
(1) A single folio is returned. This is trivial.
(2) A run of contiguous folios are returned - {afs,cifs}_extend_writeback()
is likely to write them back, in which case the batch is probably not
useful. Note that *_extend_writeback() walks the xarray directly itself
as it wants contiguous folios and doesn't want to extract any folio it's
not going to use.
(3) A list of scattered folios is returned. Granted this is more efficient
if nothing else interferes - but there could be other writes in the gaps
that we then skip over, other flushes that render some of our list clean
or page invalidations. This is a change in behaviour, but I'm not sure
that matters too much since a flush/sync can only be expected to write
back what's modified at the time it is initiated.
Further, processing each entry in the list is potentially very slow
because we're doing a write across the network for each one (cifs might
bump this into the background, but it might also have to (re)open a file
handle on the server and wait for credits first to even begin the
transaction).
Which means all of the folios in the batch may then get pinned for a long
period of time - up to 14x for the last folio in the batch - which could
prevent things like page migration.
Further, we might not get to write out all the folios in the batch as
*_extend_writeback() might hit the wbc limit first.
> That said, I'm not at all convinced my version is right either. I
> can't test it, and that means I probably messed up. It looked sane to
> me when I did it, and it builds cleanly, but I honestly doubt myself.
It doesn't seem to work. A write seems to end in lots of:
CIFS: VFS: No writable handle in writepages rc=-9
being emitted. I'll poke further into it - there's always the possibility
that some other patch is interfering.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
` (3 preceding siblings ...)
2023-02-24 9:04 ` David Howells
@ 2023-02-24 12:12 ` David Howells
2023-02-24 14:31 ` [RFC][PATCH] cifs: Fix cifs_writepages_region() David Howells
` (3 subsequent siblings)
8 siblings, 0 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 12:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Vishal Moola, Steve French, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
The problem appears to be here:
if (folio_mapping(folio) != mapping ||
!folio_test_dirty(folio)) {
folio_unlock(folio);
goto skip_write;
}
in my version it's:
if (folio->mapping != mapping ||
!folio_test_dirty(folio)) {
start += folio_size(folio);
folio_unlock(folio);
continue;
}
In your version, skip_write doesn't advance start if it too many skips occur.
Changing your version to match fixes the problem - or, at least, the
symptoms. I'm not sure exactly why it's occurring, but the -EBADF (-9) is
coming from cifs_get_writable_file() not finding an open file. I think this
is a Steve question.
With Vishal's change to use filemap_get_folios_tag() instead of
find_get_pages_range_tag(), the most common file write case (open, write,
write, ..., write, close) in which all the data is added to the pagecache in
one contiguous lump without seeking, hits this a lot.
Unfortunately, unlike find_get_pages_range_tag(), filemap_get_folios_tag()
doesn't allow you to set a limit on the number of pages it will return.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
` (4 preceding siblings ...)
2023-02-24 12:12 ` David Howells
@ 2023-02-24 14:31 ` David Howells
2023-02-24 16:06 ` Linus Torvalds
` (3 more replies)
2023-02-24 14:48 ` [RFC][PATCH] cifs, afs: Revert changes to {cifs,afs}_writepages_region() David Howells
` (2 subsequent siblings)
8 siblings, 4 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 14:31 UTC (permalink / raw)
To: Linus Torvalds, Steve French
Cc: dhowells, Vishal Moola, Andrew Morton, Jan Kara, Paulo Alcantara,
Matthew Wilcox, Huang Ying, Baolin Wang, Xin Hao, linux-mm,
mm-commits, linux-kernel
Here's the simplest fix for cifs_writepages_region() that gets it to work.
Fix the cifs_writepages_region() to just skip over members of the batch that
have been cleaned up rather than retrying them. I'm not entirely sure why it
fixes it, though. It's also not the most efficient as, in the common case,
this is going to happen a lot because cifs_extend_writeback() is going to
clean up the contiguous pages in the batch - and then this skip will occur for
those.
Fix: 3822a7c40997 ("Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 5365a3299088..ebfcaae8c437 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2893,8 +2893,9 @@ static int cifs_writepages_region(struct address_space *mapping,
if (folio_mapping(folio) != mapping ||
!folio_test_dirty(folio)) {
+ start += folio_size(folio);
folio_unlock(folio);
- goto skip_write;
+ continue;
}
if (folio_test_writeback(folio) ||
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 14:31 ` [RFC][PATCH] cifs: Fix cifs_writepages_region() David Howells
@ 2023-02-24 16:06 ` Linus Torvalds
2023-02-24 16:11 ` Matthew Wilcox
` (2 subsequent siblings)
3 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 16:06 UTC (permalink / raw)
To: David Howells
Cc: Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]
On Fri, Feb 24, 2023 at 6:31 AM David Howells <dhowells@redhat.com> wrote:
>
> Here's the simplest fix for cifs_writepages_region() that gets it to work.
Hmm. The commit message for this is wrong.
> Fix the cifs_writepages_region() to just skip over members of the batch that
> have been cleaned up rather than retrying them.
It never retried them. The "skip_write" code did that same
start += folio_size(folio);
continue;
that your patch does, but it *also* had that
if (skips >= 5 || need_resched()) {
thing to just stop writing entirely.
> I'm not entirely sure why it fixes it, though.
Yes. Strange. Because it does the exact same thing as the "Oh, the
trylock worked, but it was still under writeback or fscache" did. I
just merged all the "skip write" cases.
But the code is clearly (a) not working and (b) the whole skip count
and need_resched() logic is a bit strange to begin with.
Can you humor me, and try if just removing that skip count thing
instead? IOW, this attached patch? Because that whole "let's stop
writing if we need to reschedule" sounds truly odd (we have a
cond_resched(), although it's per folio batch, not per-folio), and the
skip count logic doesn't make much sense to me either.
SteveF?
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 875 bytes --]
fs/cifs/file.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 5365a3299088..7061d263315d 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2858,7 +2858,6 @@ static int cifs_writepages_region(struct address_space *mapping,
loff_t start, loff_t end, loff_t *_next)
{
struct folio_batch fbatch;
- int skips = 0;
folio_batch_init(&fbatch);
do {
@@ -2927,17 +2926,6 @@ static int cifs_writepages_region(struct address_space *mapping,
return ret;
skip_write:
- /*
- * Too many skipped writes, or need to reschedule?
- * Treat it as a write error without an error code.
- */
- if (skips >= 5 || need_resched()) {
- ret = 0;
- goto write_error;
- }
-
- /* Otherwise, just skip that folio and go on to the next */
- skips++;
start += folio_size(folio);
continue;
}
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 14:31 ` [RFC][PATCH] cifs: Fix cifs_writepages_region() David Howells
2023-02-24 16:06 ` Linus Torvalds
@ 2023-02-24 16:11 ` Matthew Wilcox
2023-02-24 17:15 ` David Howells
2023-02-24 17:19 ` David Howells
3 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-02-24 16:11 UTC (permalink / raw)
To: David Howells
Cc: Linus Torvalds, Steve French, Vishal Moola, Andrew Morton,
Jan Kara, Paulo Alcantara, Huang Ying, Baolin Wang, Xin Hao,
linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 02:31:15PM +0000, David Howells wrote:
> Here's the simplest fix for cifs_writepages_region() that gets it to work.
>
> Fix the cifs_writepages_region() to just skip over members of the batch that
> have been cleaned up rather than retrying them. I'm not entirely sure why it
> fixes it, though. It's also not the most efficient as, in the common case,
> this is going to happen a lot because cifs_extend_writeback() is going to
> clean up the contiguous pages in the batch - and then this skip will occur for
> those.
Why are you doing it this way? What's wrong with using
write_cache_pages() to push all the contiguous dirty folios into a single
I/O object, submitting it when the folios turn out not to be contiguous,
or when we run out of a batch?
You've written an awful lot of code here and it's a different model from
every other filesystem. Why is it better?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 14:31 ` [RFC][PATCH] cifs: Fix cifs_writepages_region() David Howells
2023-02-24 16:06 ` Linus Torvalds
2023-02-24 16:11 ` Matthew Wilcox
@ 2023-02-24 17:15 ` David Howells
2023-02-24 18:44 ` Matthew Wilcox
2023-02-24 20:13 ` David Howells
2023-02-24 17:19 ` David Howells
3 siblings, 2 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 17:15 UTC (permalink / raw)
To: Matthew Wilcox
Cc: dhowells, Linus Torvalds, Steve French, Vishal Moola,
Andrew Morton, Jan Kara, Paulo Alcantara, Huang Ying,
Baolin Wang, Xin Hao, linux-mm, mm-commits, linux-kernel
Matthew Wilcox <willy@infradead.org> wrote:
> Why are you doing it this way? What's wrong with using
> write_cache_pages() to push all the contiguous dirty folios into a single
> I/O object, submitting it when the folios turn out not to be contiguous,
> or when we run out of a batch?
>
> You've written an awful lot of code here and it's a different model from
> every other filesystem. Why is it better?
Because write_cache_pages():
(1) Takes no account of fscache. I can't just build knowledge of PG_fscache
into it because PG_private_2 may be used differently by other filesystems
(btrfs, for example). (I'm also trying to phase out the use of
PG_private_2 and instead uses PG_writeback to cover both and the
difference will be recorded elsewhere - but that's not there yet).
(2) Calls ->writepage() individually for each folio - which is excessive. In
AFS's implementation, we locate the first folio, then race through the
following folios without ever waiting until we hit something that's
locked or a gap and then stop and submit.
write_cache_pages(), otoh, calls us with the next folio already undirtied
and set for writeback when we find out that we don't want it yet.
(3) Skips over holes, but at some point in the future we're going to need to
schedule adjacent clean pages (before and after) for writeback too to
handle transport compression and fscache updates if the granule size for
either is larger than the folio size.
It might be better to take what's in cifs, generalise it and replace
write_cache_pages() with it, then have a "->submit_write()" aop that takes an
ITER_XARRAY iterator to write from.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 17:15 ` David Howells
@ 2023-02-24 18:44 ` Matthew Wilcox
2023-02-24 20:13 ` David Howells
1 sibling, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-02-24 18:44 UTC (permalink / raw)
To: David Howells
Cc: Linus Torvalds, Steve French, Vishal Moola, Andrew Morton,
Jan Kara, Paulo Alcantara, Huang Ying, Baolin Wang, Xin Hao,
linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 05:15:41PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
>
> > Why are you doing it this way? What's wrong with using
> > write_cache_pages() to push all the contiguous dirty folios into a single
> > I/O object, submitting it when the folios turn out not to be contiguous,
> > or when we run out of a batch?
> >
> > You've written an awful lot of code here and it's a different model from
> > every other filesystem. Why is it better?
>
> Because write_cache_pages():
>
> (1) Takes no account of fscache. I can't just build knowledge of PG_fscache
> into it because PG_private_2 may be used differently by other filesystems
> (btrfs, for example). (I'm also trying to phase out the use of
> PG_private_2 and instead uses PG_writeback to cover both and the
> difference will be recorded elsewhere - but that's not there yet).
I don't understand why waiting for fscache here is even the right thing
to do. The folio is dirty and needs to be written back to the
server. Why should beginning that write wait for the current write
to the fscache to complete?
> (2) Calls ->writepage() individually for each folio - which is excessive. In
> AFS's implementation, we locate the first folio, then race through the
> following folios without ever waiting until we hit something that's
> locked or a gap and then stop and submit.
>
> write_cache_pages(), otoh, calls us with the next folio already undirtied
> and set for writeback when we find out that we don't want it yet.
I think you're so convinced that your way is better that you don't see
what write_cache_pages() is actually doing. Yes, the writepage callback
is called once for each folio, but that doesn't actually submit a write.
Instead, they're accumulated into the equivalent of a wdata and the
wdata is submitted when there's a discontiguity or we've accumulated
enough to satisfy the constraints of the wbc.
> (3) Skips over holes, but at some point in the future we're going to need to
> schedule adjacent clean pages (before and after) for writeback too to
> handle transport compression and fscache updates if the granule size for
> either is larger than the folio size.
Then we'll need it for other filesystems too, so should enhance
write_cache_pages().
> It might be better to take what's in cifs, generalise it and replace
> write_cache_pages() with it, then have a "->submit_write()" aop that takes an
> ITER_XARRAY iterator to write from.
->writepages _is_ ->submit_write. Should write_cache_pages() be better?
Maybe! But it works a whole lot better than what AFS was doing and
you've now ladelled into cifs.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 17:15 ` David Howells
2023-02-24 18:44 ` Matthew Wilcox
@ 2023-02-24 20:13 ` David Howells
2023-02-24 20:16 ` Linus Torvalds
2023-02-24 20:58 ` David Howells
1 sibling, 2 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 20:13 UTC (permalink / raw)
To: Matthew Wilcox
Cc: dhowells, Linus Torvalds, Steve French, Vishal Moola,
Andrew Morton, Jan Kara, Paulo Alcantara, Huang Ying,
Baolin Wang, Xin Hao, linux-mm, mm-commits, linux-kernel
Matthew Wilcox <willy@infradead.org> wrote:
> I don't understand why waiting for fscache here is even the right thing
> to do.
Then why do we have to wait for PG_writeback to complete?
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 20:13 ` David Howells
@ 2023-02-24 20:16 ` Linus Torvalds
2023-02-24 20:45 ` Matthew Wilcox
2023-02-27 13:20 ` David Howells
2023-02-24 20:58 ` David Howells
1 sibling, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 20:16 UTC (permalink / raw)
To: David Howells
Cc: Matthew Wilcox, Steve French, Vishal Moola, Andrew Morton,
Jan Kara, Paulo Alcantara, Huang Ying, Baolin Wang, Xin Hao,
linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote:
>
> Then why do we have to wait for PG_writeback to complete?
At least for PG_writeback, it's about "the _previous_ dirty write is
still under way, but - since PG_dirty is set again - the page has been
dirtied since".
So we have to start _another_ writeback, because while the current
writeback *might* have written the updated data, that is not at all
certain or clear.
I'm not sure what the fscache rules are.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 20:16 ` Linus Torvalds
@ 2023-02-24 20:45 ` Matthew Wilcox
2023-02-27 13:20 ` David Howells
1 sibling, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-02-24 20:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Howells, Steve French, Vishal Moola, Andrew Morton,
Jan Kara, Paulo Alcantara, Huang Ying, Baolin Wang, Xin Hao,
linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote:
> On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote:
> >
> > Then why do we have to wait for PG_writeback to complete?
>
> At least for PG_writeback, it's about "the _previous_ dirty write is
> still under way, but - since PG_dirty is set again - the page has been
> dirtied since".
>
> So we have to start _another_ writeback, because while the current
> writeback *might* have written the updated data, that is not at all
> certain or clear.
also, we only have a writeback bit, not a writeback count. And when
the current writeback completes, it'll clear that bit. We're also
being kind to our backing store and not writing to the same block twice
at the same time.
> I'm not sure what the fscache rules are.
My understanding is that the fscache bit is set under several
circumstances, but if the folio is dirty _and_ the fscache bit
is set, it means the folio is currently being written to the cache
device. I don't see a conflict there; we can write to the backing
store and the cache device at the same time.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 20:16 ` Linus Torvalds
2023-02-24 20:45 ` Matthew Wilcox
@ 2023-02-27 13:20 ` David Howells
1 sibling, 0 replies; 33+ messages in thread
From: David Howells @ 2023-02-27 13:20 UTC (permalink / raw)
To: Matthew Wilcox
Cc: dhowells, Linus Torvalds, Steve French, Vishal Moola,
Andrew Morton, Jan Kara, Paulo Alcantara, Huang Ying,
Baolin Wang, Xin Hao, linux-mm, mm-commits, linux-kernel
Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote:
> > >
> > > Then why do we have to wait for PG_writeback to complete?
> >
> > At least for PG_writeback, it's about "the _previous_ dirty write is
> > still under way, but - since PG_dirty is set again - the page has been
> > dirtied since".
> >
> > So we have to start _another_ writeback, because while the current
> > writeback *might* have written the updated data, that is not at all
> > certain or clear.
>
> also, we only have a writeback bit, not a writeback count. And when
> the current writeback completes, it'll clear that bit. We're also
> being kind to our backing store and not writing to the same block twice
> at the same time.
It's not so much being kind to the backing store, I think, as avoiding the
possibility that the writes happen out of order.
> > I'm not sure what the fscache rules are.
>
> My understanding is that the fscache bit is set under several
> circumstances, but if the folio is dirty _and_ the fscache bit
> is set, it means the folio is currently being written to the cache
> device. I don't see a conflict there; we can write to the backing
> store and the cache device at the same time.
The dirty bit is nothing to do with it. If the fscache bit is set, then the
page is currently being written to the cache - and we need to wait before
starting another write.
Sometimes we start a write to the cache from a clean page (e.g. we just read
it from the server) and sometimes we start a write to the cache from
writepages (e.g. the data is dirty and we're writing it to the server as
well).
Things will become more 'interesting' should we ever get around to
implementing disconnected operation. Then we might end up staging dirty data
through the cache.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 20:13 ` David Howells
2023-02-24 20:16 ` Linus Torvalds
@ 2023-02-24 20:58 ` David Howells
1 sibling, 0 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 20:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Matthew Wilcox, Steve French, Vishal Moola,
Andrew Morton, Jan Kara, Paulo Alcantara, Huang Ying,
Baolin Wang, Xin Hao, linux-mm, mm-commits, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Then why do we have to wait for PG_writeback to complete?
>
> At least for PG_writeback, it's about "the _previous_ dirty write is
> still under way, but - since PG_dirty is set again - the page has been
> dirtied since".
>
> So we have to start _another_ writeback, because while the current
> writeback *might* have written the updated data, that is not at all
> certain or clear.
As I understand it, it's also about serialising writes from the same page to
the same backing store. We don't want them to end up out-of-order. I'm not
sure what guarantees, for instance, the block layer gives if two I/O requests
go to the same place.
> I'm not sure what the fscache rules are.
I'm now using PG_fscache in exactly the same way: the previous write to the
cache is still under way. I don't want to start another DIO write to the
cache for the same pages.
Hence the waits/checks on PG_fscache I've added anywhere we need to wait/check
on PG_writeback.
As I mentioned I'm looking at the possibility of making PG_dirty and
PG_writeback cover *both* cases and recording the difference elsewhere -
thereby returning PG_private_2 to the VM folks who'd like their bit back.
This means, for instance, when we read from the server and find we need to
write it to the cache, we set a note in the aforementioned elsewhere, mark the
page dirty and leave it to writepages() to effect the write to the cache.
It could get tricky because we have two different places to write to, with
very different characteristics (e.g. ~6000km away server vs local SSD) with
their own queueing, scheduling, bandwidth, etc. - and the local disk might
have to share with the system.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 14:31 ` [RFC][PATCH] cifs: Fix cifs_writepages_region() David Howells
` (2 preceding siblings ...)
2023-02-24 17:15 ` David Howells
@ 2023-02-24 17:19 ` David Howells
2023-02-24 18:58 ` Linus Torvalds
3 siblings, 1 reply; 33+ messages in thread
From: David Howells @ 2023-02-24 17:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Can you humor me, and try if just removing that skip count thing
> instead? IOW, this attached patch?
That works too.
> Because that whole "let's stop writing if we need to reschedule" sounds
> truly odd (we have a cond_resched(), although it's per folio batch, not
> per-folio), and the skip count logic doesn't make much sense to me either.
The skip thing, in my code, is only used in WB_SYNC_NONE mode. If we hit 5
things in progress or rescheduling is required, we return to the caller on the
basis that conflicting flushes appear to be happening in other threads.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 17:19 ` David Howells
@ 2023-02-24 18:58 ` Linus Torvalds
2023-02-24 19:05 ` Linus Torvalds
0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 18:58 UTC (permalink / raw)
To: David Howells
Cc: Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 9:19 AM David Howells <dhowells@redhat.com> wrote:
>
> The skip thing, in my code, is only used in WB_SYNC_NONE mode. If we hit 5
> things in progress or rescheduling is required, we return to the caller on the
> basis that conflicting flushes appear to be happening in other threads.
Ahh. *That* is the difference, and I didn't realize.
I made all the skip-write cases the same, and I really meant for that
case to only trigger for WB_SYNC_NONE, but I stupidly didn't notice
that the whole folio_test_dirty() re-test case was done without that
WB_SYNC_NONE case that all the other cases had.
Mea culpa, mea maxima culpa. That was just me being stupid.
So that case isn't actually a "skip write" case at all, it's actually
a "no write needed at all" case.
Your original patch is the right fix, and I was just being silly for
having not realized.
I'll apply that minimal fix for now - I think the right thing to do is
your bigger patch, but that needs more thinking (or at least splitting
up).
Sorry for the confusion, and thanks for setting me straight,
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 18:58 ` Linus Torvalds
@ 2023-02-24 19:05 ` Linus Torvalds
2023-03-01 18:32 ` [EXTERNAL] " Steven French
0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 19:05 UTC (permalink / raw)
To: David Howells
Cc: Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll apply that minimal fix for now - I think the right thing to do is
> your bigger patch, but that needs more thinking (or at least splitting
> up).
Minimal fix applied - that way I can drop this for now, and we can
discuss the whole "maybe we can just use write_cache_pages()" instead.
Because that _would_ be lovely, even if it's possible that the generic
helper might need some extra love to work better for cifs/afs.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* RE: [EXTERNAL] Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
2023-02-24 19:05 ` Linus Torvalds
@ 2023-03-01 18:32 ` Steven French
0 siblings, 0 replies; 33+ messages in thread
From: Steven French @ 2023-03-01 18:32 UTC (permalink / raw)
To: Linus Torvalds, David Howells
Cc: Vishal Moola, Andrew Morton, Jan Kara, pc, Matthew Wilcox,
Huang Ying, Baolin Wang, Xin Hao, linux-mm, mm-commits,
linux-kernel, Steve French
Have been doing additional testing of these - and also verified that David's most recent patch, now in my for-next branch ("cifs: Fix memory leak in direct I/O") fixes the remaining problem (the issue with xfstest 208 that Murphy pointed out). Will send the three small cifs fixes from David later this week, along with some unrelated fixes from Paulo and Shyam.
-----Original Message-----
From: Linus Torvalds <torvalds@linux-foundation.org>
Sent: Friday, February 24, 2023 1:06 PM
To: David Howells <dhowells@redhat.com>
Cc: Steven French <Steven.French@microsoft.com>; Vishal Moola <vishal.moola@gmail.com>; Andrew Morton <akpm@linux-foundation.org>; Jan Kara <jack@suse.cz>; pc <pc@cjr.nz>; Matthew Wilcox <willy@infradead.org>; Huang Ying <ying.huang@intel.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Xin Hao <xhao@linux.alibaba.com>; linux-mm@kvack.org; mm-commits@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: [EXTERNAL] Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()
[You don't often get email from torvalds@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> I'll apply that minimal fix for now - I think the right thing to do is
> your bigger patch, but that needs more thinking (or at least splitting
> up).
Minimal fix applied - that way I can drop this for now, and we can discuss the whole "maybe we can just use write_cache_pages()" instead.
Because that _would_ be lovely, even if it's possible that the generic helper might need some extra love to work better for cifs/afs.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC][PATCH] cifs, afs: Revert changes to {cifs,afs}_writepages_region()
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
` (5 preceding siblings ...)
2023-02-24 14:31 ` [RFC][PATCH] cifs: Fix cifs_writepages_region() David Howells
@ 2023-02-24 14:48 ` David Howells
2023-02-24 16:13 ` Linus Torvalds
2023-02-24 15:13 ` [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag() David Howells
2023-02-26 2:43 ` [GIT PULL] MM updates for 6.3-rc1 Linus Torvalds
8 siblings, 1 reply; 33+ messages in thread
From: David Howells @ 2023-02-24 14:48 UTC (permalink / raw)
To: Linus Torvalds, Steve French
Cc: dhowells, Vishal Moola, Andrew Morton, Jan Kara, Paulo Alcantara,
Matthew Wilcox, Huang Ying, Baolin Wang, Xin Hao, linux-mm,
mm-commits, linux-kernel
Here's a more complex patch that reverts Vishal's patch to afs and your
changes to cifs back to the point where find_get_pages_range_tag() was
being used to get a single folio and then replace that with a function,
filemap_get_folio_tag() that just gets a single folio. An alternative way
of doing this would be to make filemap_get_folios_tag() take a limit count.
This is likely to be more efficient for the common case as
*_extend_writeback() will deal with pages that are contiguous to the
starting page before we get on to continuing to process the batch.
For filemap_get_folios_tag() to be of use, the batch has to be passed down,
and if it contains scattered, non-contiguous pages, these are likely to end
up being pinned by the batch for significant periods of time whilst I/O is
undertaken on earlier pages.
Fix: 3822a7c40997 ("Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
Fix: acc8d8588cb7 ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()")
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/afs/write.c | 116 +++++++++++++++++++++++-------------------------
fs/cifs/file.c | 114 ++++++++++++++++++++---------------------------
include/linux/pagemap.h | 2
mm/filemap.c | 58 ++++++++++++++++++++++++
4 files changed, 165 insertions(+), 125 deletions(-)
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 571f3b9a417e..b04a95262c4f 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -704,87 +704,83 @@ static int afs_writepages_region(struct address_space *mapping,
bool max_one_loop)
{
struct folio *folio;
- struct folio_batch fbatch;
ssize_t ret;
- unsigned int i;
int n, skips = 0;
_enter("%llx,%llx,", start, end);
- folio_batch_init(&fbatch);
do {
pgoff_t index = start / PAGE_SIZE;
- n = filemap_get_folios_tag(mapping, &index, end / PAGE_SIZE,
- PAGECACHE_TAG_DIRTY, &fbatch);
-
- if (!n)
+ folio = filemap_get_folio_tag(mapping, &index, end / PAGE_SIZE,
+ PAGECACHE_TAG_DIRTY);
+ if (!folio)
break;
- for (i = 0; i < n; i++) {
- folio = fbatch.folios[i];
- start = folio_pos(folio); /* May regress with THPs */
- _debug("wback %lx", folio_index(folio));
+ start = folio_pos(folio); /* May regress with THPs */
- /* At this point we hold neither the i_pages lock nor the
- * page lock: the page may be truncated or invalidated
- * (changing page->mapping to NULL), or even swizzled
- * back from swapper_space to tmpfs file mapping
- */
- if (wbc->sync_mode != WB_SYNC_NONE) {
- ret = folio_lock_killable(folio);
- if (ret < 0) {
- folio_batch_release(&fbatch);
- return ret;
- }
- } else {
- if (!folio_trylock(folio))
- continue;
- }
+ _debug("wback %lx", folio_index(folio));
- if (folio->mapping != mapping ||
- !folio_test_dirty(folio)) {
- start += folio_size(folio);
- folio_unlock(folio);
- continue;
+ /* At this point we hold neither the i_pages lock nor the
+ * page lock: the page may be truncated or invalidated
+ * (changing page->mapping to NULL), or even swizzled
+ * back from swapper_space to tmpfs file mapping
+ */
+ if (wbc->sync_mode != WB_SYNC_NONE) {
+ ret = folio_lock_killable(folio);
+ if (ret < 0) {
+ folio_put(folio);
+ return ret;
+ }
+ } else {
+ if (!folio_trylock(folio)) {
+ folio_put(folio);
+ return 0;
}
+ }
- if (folio_test_writeback(folio) ||
- folio_test_fscache(folio)) {
- folio_unlock(folio);
- if (wbc->sync_mode != WB_SYNC_NONE) {
- folio_wait_writeback(folio);
+ if (folio_mapping(folio) != mapping ||
+ !folio_test_dirty(folio)) {
+ start += folio_size(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ continue;
+ }
+
+ if (folio_test_writeback(folio) ||
+ folio_test_fscache(folio)) {
+ folio_unlock(folio);
+ if (wbc->sync_mode != WB_SYNC_NONE) {
+ folio_wait_writeback(folio);
#ifdef CONFIG_AFS_FSCACHE
- folio_wait_fscache(folio);
+ folio_wait_fscache(folio);
#endif
- } else {
- start += folio_size(folio);
- }
- if (wbc->sync_mode == WB_SYNC_NONE) {
- if (skips >= 5 || need_resched()) {
- *_next = start;
- _leave(" = 0 [%llx]", *_next);
- return 0;
- }
- skips++;
- }
- continue;
+ } else {
+ start += folio_size(folio);
}
-
- if (!folio_clear_dirty_for_io(folio))
- BUG();
- ret = afs_write_back_from_locked_folio(mapping, wbc,
- folio, start, end);
- if (ret < 0) {
- _leave(" = %zd", ret);
- folio_batch_release(&fbatch);
- return ret;
+ folio_put(folio);
+ if (wbc->sync_mode == WB_SYNC_NONE) {
+ if (skips >= 5 || need_resched())
+ break;
+ skips++;
}
+ continue;
+ }
- start += ret;
+ if (!folio_clear_dirty_for_io(folio))
+ BUG();
+ ret = afs_write_back_from_locked_folio(mapping, wbc, folio, start, end);
+ folio_put(folio);
+ if (ret < 0) {
+ _leave(" = %zd", ret);
+ return ret;
}
- folio_batch_release(&fbatch);
+ start += ret;
+
+ if (max_one_loop)
+ break;
+
cond_resched();
} while (wbc->nr_to_write > 0);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 5365a3299088..121254086e30 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2857,92 +2857,76 @@ static int cifs_writepages_region(struct address_space *mapping,
struct writeback_control *wbc,
loff_t start, loff_t end, loff_t *_next)
{
- struct folio_batch fbatch;
+ struct folio *folio;
+ ssize_t ret;
int skips = 0;
- folio_batch_init(&fbatch);
do {
- int nr;
pgoff_t index = start / PAGE_SIZE;
- nr = filemap_get_folios_tag(mapping, &index, end / PAGE_SIZE,
- PAGECACHE_TAG_DIRTY, &fbatch);
- if (!nr)
+ folio = filemap_get_folio_tag(mapping, &index, end / PAGE_SIZE,
+ PAGECACHE_TAG_DIRTY);
+ if (!folio)
break;
- for (int i = 0; i < nr; i++) {
- ssize_t ret;
- struct folio *folio = fbatch.folios[i];
-
-redo_folio:
- start = folio_pos(folio); /* May regress with THPs */
+ start = folio_pos(folio); /* May regress with THPs */
- /* At this point we hold neither the i_pages lock nor the
- * page lock: the page may be truncated or invalidated
- * (changing page->mapping to NULL), or even swizzled
- * back from swapper_space to tmpfs file mapping
- */
- if (wbc->sync_mode != WB_SYNC_NONE) {
- ret = folio_lock_killable(folio);
- if (ret < 0)
- goto write_error;
- } else {
- if (!folio_trylock(folio))
- goto skip_write;
+ /* At this point we hold neither the i_pages lock nor the
+ * page lock: the page may be truncated or invalidated
+ * (changing page->mapping to NULL), or even swizzled
+ * back from swapper_space to tmpfs file mapping
+ */
+ if (wbc->sync_mode != WB_SYNC_NONE) {
+ ret = folio_lock_killable(folio);
+ if (ret < 0) {
+ folio_put(folio);
+ return ret;
}
-
- if (folio_mapping(folio) != mapping ||
- !folio_test_dirty(folio)) {
- folio_unlock(folio);
- goto skip_write;
+ } else {
+ if (!folio_trylock(folio)) {
+ folio_put(folio);
+ return 0;
}
+ }
- if (folio_test_writeback(folio) ||
- folio_test_fscache(folio)) {
- folio_unlock(folio);
- if (wbc->sync_mode == WB_SYNC_NONE)
- goto skip_write;
+ if (folio_mapping(folio) != mapping ||
+ !folio_test_dirty(folio)) {
+ start += folio_size(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ continue;
+ }
+ if (folio_test_writeback(folio) ||
+ folio_test_fscache(folio)) {
+ folio_unlock(folio);
+ if (wbc->sync_mode != WB_SYNC_NONE) {
folio_wait_writeback(folio);
#ifdef CONFIG_CIFS_FSCACHE
folio_wait_fscache(folio);
#endif
- goto redo_folio;
+ } else {
+ start += folio_size(folio);
}
-
- if (!folio_clear_dirty_for_io(folio))
- /* We hold the page lock - it should've been dirty. */
- WARN_ON(1);
-
- ret = cifs_write_back_from_locked_folio(mapping, wbc, folio, start, end);
- if (ret < 0)
- goto write_error;
-
- start += ret;
- continue;
-
-write_error:
- folio_batch_release(&fbatch);
- *_next = start;
- return ret;
-
-skip_write:
- /*
- * Too many skipped writes, or need to reschedule?
- * Treat it as a write error without an error code.
- */
- if (skips >= 5 || need_resched()) {
- ret = 0;
- goto write_error;
+ folio_put(folio);
+ if (wbc->sync_mode == WB_SYNC_NONE) {
+ if (skips >= 5 || need_resched())
+ break;
+ skips++;
}
-
- /* Otherwise, just skip that folio and go on to the next */
- skips++;
- start += folio_size(folio);
continue;
}
- folio_batch_release(&fbatch);
+ if (!folio_clear_dirty_for_io(folio))
+ /* We hold the page lock - it should've been dirty. */
+ WARN_ON(1);
+
+ ret = cifs_write_back_from_locked_folio(mapping, wbc, folio, start, end);
+ folio_put(folio);
+ if (ret < 0)
+ return ret;
+
+ start += ret;
cond_resched();
} while (wbc->nr_to_write > 0);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0acb8e1fb7af..577535633006 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -741,6 +741,8 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch);
+struct folio *filemap_get_folio_tag(struct address_space *mapping, pgoff_t *start,
+ pgoff_t end, xa_mark_t tag);
struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2723104cc06a..1b1e9c661018 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2339,6 +2339,64 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
}
EXPORT_SYMBOL(filemap_get_folios_tag);
+/**
+ * filemap_get_folio_tag - Get the first folio matching @tag
+ * @mapping: The address_space to search
+ * @start: The starting page index
+ * @end: The final page index (inclusive)
+ * @tag: The tag index
+ *
+ * Search for and return the first folios in the mapping starting at index
+ * @start and up to index @end (inclusive). The folio is returned with an
+ * elevated reference count.
+ *
+ * If a folio is returned, it may start before @start; if it does, it will
+ * contain @start. The folio may also extend beyond @end; if it does, it will
+ * contain @end. If folios are added to or removed from the page cache while
+ * this is running, they may or may not be found by this call.
+ *
+ * Return: The folio that was found or NULL. @start is also updated to index
+ * the next folio for the traversal or will be left pointing after @end.
+ */
+struct folio *filemap_get_folio_tag(struct address_space *mapping, pgoff_t *start,
+ pgoff_t end, xa_mark_t tag)
+{
+ XA_STATE(xas, &mapping->i_pages, *start);
+ struct folio *folio;
+
+ rcu_read_lock();
+ while ((folio = find_get_entry(&xas, end, tag)) != NULL) {
+ /*
+ * Shadow entries should never be tagged, but this iteration
+ * is lockless so there is a window for page reclaim to evict
+ * a page we saw tagged. Skip over it.
+ */
+ if (xa_is_value(folio))
+ continue;
+
+ if (folio_test_hugetlb(folio))
+ *start = folio->index + 1;
+ else
+ *start = folio_next_index(folio);
+ goto out;
+ }
+
+ /*
+ * We come here when there is no page beyond @end. We take care to not
+ * overflow the index @start as it confuses some of the callers. This
+ * breaks the iteration when there is a page at index -1 but that is
+ * already broke anyway.
+ */
+ if (end == (pgoff_t)-1)
+ *start = (pgoff_t)-1;
+ else
+ *start = end + 1;
+out:
+ rcu_read_unlock();
+ return folio;
+}
+EXPORT_SYMBOL(filemap_get_folio_tag);
+
/*
* CD/DVDs are error prone. When a medium error occurs, the driver may fail
* a _large_ part of the i/o request. Imagine the worst scenario:
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs, afs: Revert changes to {cifs,afs}_writepages_region()
2023-02-24 14:48 ` [RFC][PATCH] cifs, afs: Revert changes to {cifs,afs}_writepages_region() David Howells
@ 2023-02-24 16:13 ` Linus Torvalds
0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 16:13 UTC (permalink / raw)
To: David Howells
Cc: Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-kernel
On Fri, Feb 24, 2023 at 6:48 AM David Howells <dhowells@redhat.com> wrote:
>
> Here's a more complex patch that reverts Vishal's patch to afs and your
> changes to cifs back to the point where find_get_pages_range_tag() was
> being used to get a single folio and then replace that with a function,
> filemap_get_folio_tag() that just gets a single folio. An alternative way
> of doing this would be to make filemap_get_folios_tag() take a limit count.
Ack. I think this is the right thing to do.
Except we should at a minimum split this into two patches, with the
first one introducing that filemap_get_folio_tag() helper function
that makes the whole find_get_pages_range_tag() conversion much
simpler.
In fact, probably three patches - infrastructure, cifs and afs
separately. Just to make each patch simpler to look at for people who
care about that particular area.
And I'd still like to know why that 'skips' logic exists, it seems to
be shared with afs. The fact that it actually seems to change
semantics by your testing makes me even more suspicious of it than I
was when I was doing that "skip_write" thing.
(But the change I did was to treat _all_ the skipped writes the same,
while the old - and your revert - behavior is to only do the skip
counting for the "already under writeback" case. But it still stinks
to me).
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag()
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
` (6 preceding siblings ...)
2023-02-24 14:48 ` [RFC][PATCH] cifs, afs: Revert changes to {cifs,afs}_writepages_region() David Howells
@ 2023-02-24 15:13 ` David Howells
2023-02-24 16:22 ` Linus Torvalds
2023-02-24 17:22 ` David Howells
2023-02-26 2:43 ` [GIT PULL] MM updates for 6.3-rc1 Linus Torvalds
8 siblings, 2 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 15:13 UTC (permalink / raw)
To: Linus Torvalds, Steve French
Cc: dhowells, Vishal Moola, Andrew Morton, Jan Kara, Paulo Alcantara,
Matthew Wilcox, Huang Ying, Baolin Wang, Xin Hao, linux-mm,
mm-commits, linux-cifs, linux-kernel
[This additional to the "cifs: Fix cifs_writepages_region()" patch that I
posted]
The inefficiency derived from filemap_get_folios_tag() get a batch of
contiguous folios in Vishal's change to afs that got copied into cifs can
be reduced by skipping over those folios that have been passed by the start
position rather than going through the process of locking, checking and
trying to write them.
A similar change would need to be made in afs, in addition to fixing the bugs
there.
There's also a fix in cifs_write_back_from_locked_folio() where it doesn't
return the amount of data dispatched to the server as ->async_writev() just
returns 0 on success.
Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index ebfcaae8c437..bae1a9709e32 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2839,6 +2839,7 @@ static ssize_t cifs_write_back_from_locked_folio(struct address_space *mapping,
free_xid(xid);
if (rc == 0) {
wbc->nr_to_write = count;
+ rc = len;
} else if (is_retryable_error(rc)) {
cifs_pages_write_redirty(inode, start, len);
} else {
@@ -2873,6 +2874,13 @@ static int cifs_writepages_region(struct address_space *mapping,
for (int i = 0; i < nr; i++) {
ssize_t ret;
struct folio *folio = fbatch.folios[i];
+ unsigned long long fstart;
+
+ fstart = folio_pos(folio); /* May go backwards with THPs */
+ if (fstart < start &&
+ folio_size(folio) <= start - fstart)
+ continue;
+ start = fstart;
redo_folio:
start = folio_pos(folio); /* May regress with THPs */
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag()
2023-02-24 15:13 ` [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag() David Howells
@ 2023-02-24 16:22 ` Linus Torvalds
2023-02-24 17:22 ` David Howells
1 sibling, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2023-02-24 16:22 UTC (permalink / raw)
To: David Howells
Cc: Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-cifs, linux-kernel
On Fri, Feb 24, 2023 at 7:13 AM David Howells <dhowells@redhat.com> wrote:
>
> The inefficiency derived from filemap_get_folios_tag() get a batch of
> contiguous folios in Vishal's change to afs that got copied into cifs can
> be reduced by skipping over those folios that have been passed by the start
> position rather than going through the process of locking, checking and
> trying to write them.
This patch just makes me go "Ugh".
There's something wrong with this code for it to need these games.
That just makes me convinced that your other patch that just gets rid
of the batching entirely is the right one.
Of course, I'd be even happier if Willy is right and the code could
use the generic write_cache_pages() and avoid all of these things
entirely. I'm not clear on why cifs and afs are being so different in
the first place, and some of the differences are just odd (like that
skip count).
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag()
2023-02-24 15:13 ` [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag() David Howells
2023-02-24 16:22 ` Linus Torvalds
@ 2023-02-24 17:22 ` David Howells
1 sibling, 0 replies; 33+ messages in thread
From: David Howells @ 2023-02-24 17:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Steve French, Vishal Moola, Andrew Morton, Jan Kara,
Paulo Alcantara, Matthew Wilcox, Huang Ying, Baolin Wang,
Xin Hao, linux-mm, mm-commits, linux-cifs, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Of course, I'd be even happier if Willy is right and the code could
> use the generic write_cache_pages() and avoid all of these things
> entirely. I'm not clear on why cifs and afs are being so different in
> the first place, and some of the differences are just odd (like that
> skip count).
The main reason is that write_cache_pages() doesn't (and can't) check
PG_fscache (btrfs uses PG_private_2 for other purposes). NFS, 9p and ceph,
for the moment, don't cache files that are open for writing, but I'm intending
to change that at some point. The intention is to unify the writepages code
for at least 9p, afs, ceph and cifs in netfslib in the future.
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-20 21:52 [GIT PULL] MM updates for 6.3-rc1 Andrew Morton
` (7 preceding siblings ...)
2023-02-24 15:13 ` [RFC][PATCH] cifs: Improve use of filemap_get_folios_tag() David Howells
@ 2023-02-26 2:43 ` Linus Torvalds
2023-02-26 3:27 ` Linus Torvalds
8 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2023-02-26 2:43 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, mm-commits, linux-kernel
On Mon, Feb 20, 2023 at 1:52 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Linus, please merge this cycle's MM updates.
Hmm. Just a note to let you know that I'm bisecting google-chrome
dumping core, and it seems to be due to something in this pull.
I've got a couple of hundred commits to go, so let's see what it
bisects to, but right now all the bad state came from this pull,
afaik.
Maybe it's a false alarm, but it seems to be consistent so far.
Will update when I've bisected closer.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-26 2:43 ` [GIT PULL] MM updates for 6.3-rc1 Linus Torvalds
@ 2023-02-26 3:27 ` Linus Torvalds
2023-02-26 3:53 ` Linus Torvalds
0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2023-02-26 3:27 UTC (permalink / raw)
To: Andrew Morton, Liam R. Howlett; +Cc: linux-mm, mm-commits, linux-kernel
On Sat, Feb 25, 2023 at 6:43 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. Just a note to let you know that I'm bisecting google-chrome
> dumping core, and it seems to be due to something in this pull.
>
> I've got a couple of hundred commits to go, so let's see what it
> bisects to, but right now all the bad state came from this pull,
> afaik.
>
> Maybe it's a false alarm, but it seems to be consistent so far.
>
> Will update when I've bisected closer.
It's solidly in the "change XYZ to use vma iterator" chunk now.
Commit 0378c0a0e9e4 ("mm/mmap: remove preallocation from
do_mas_align_munmap()") is good, and commit 37598f5a9d8b ("mlock:
convert mlock to vma iterator") is bad.
Will bisect further, but adding Liam to the participants because it's
now narrowed down to his changes.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-26 3:27 ` Linus Torvalds
@ 2023-02-26 3:53 ` Linus Torvalds
2023-02-26 3:57 ` Andrew Morton
0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2023-02-26 3:53 UTC (permalink / raw)
To: Andrew Morton, Liam R. Howlett; +Cc: linux-mm, mm-commits, linux-kernel
On Sat, Feb 25, 2023 at 7:27 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Will bisect further, but adding Liam to the participants because it's
> now narrowed down to his changes.
Ok, it's commit 2286a6914c77 ("mm: change mprotect_fixup to vma iterator")
It was entirely consistent, and bisected right to that all the way
from my current git tip.
Without that commit, google-chrome works fine.
With that commit, I get "Aww snap" and a
traps: ThreadPoolForeg[4337] trap invalid opcode ip:55d5542363ee
sp:7fa5e04f1f80 error:0 in chrome[55d5537d3000+a14c000]
message in the kernel dumps (and core dump noise in journalctl).
The commit before is fine.
Sadly, it doesn't revert cleanly on my current top-of-tree (or even
_remotely_ cleanly_ because of all the other vma changes), so I can't
test just reverting that on the current state.
Also, it's not like I can debug google-chrome very much. It presumably
does complex vma's and unusual mprotect() stuff to trigger this, when
nothing else seems to care.
Liam?
Linus
---
2286a6914c776ec34cd97e4573b1466d055cb9de is the first bad commit
commit 2286a6914c776ec34cd97e4573b1466d055cb9de
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date: Fri Jan 20 11:26:18 2023 -0500
mm: change mprotect_fixup to vma iterator
Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.
Link: https://lkml.kernel.org/r/20230120162650.984577-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs/exec.c | 5 ++++-
include/linux/mm.h | 6 +++---
mm/mprotect.c | 47 ++++++++++++++++++++++-------------------------
3 files changed, 29 insertions(+), 29 deletions(-)
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [GIT PULL] MM updates for 6.3-rc1
2023-02-26 3:53 ` Linus Torvalds
@ 2023-02-26 3:57 ` Andrew Morton
2023-02-26 4:03 ` Linus Torvalds
0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2023-02-26 3:57 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Liam R. Howlett, linux-mm, mm-commits, linux-kernel
On Sat, 25 Feb 2023 19:53:04 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, Feb 25, 2023 at 7:27 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Will bisect further, but adding Liam to the participants because it's
> > now narrowed down to his changes.
>
> Ok, it's commit 2286a6914c77 ("mm: change mprotect_fixup to vma iterator")
Liam sent us a fix yesterday, hopefully this?
From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Subject: mm/mprotect: Fix successful vma_merge() of next in do_mprotect_pkey()
Date: Fri, 24 Feb 2023 16:20:55 -0500
If mprotect_fixup() successfully calls vma_merge() and replaces vma and
the next vma, then the tmp variable in the do_mprotect_pkey() is not
updated to point to the new vma end. This results in the loop detecting a
gap between VMAs that does not exist. Fix the faulty value of tmp by
setting it to the end location of the vma iterator at the end of the loop.
Link: https://lkml.kernel.org/r/20230224212055.1786100-1-Liam.Howlett@oracle.com
Fixes: 2286a6914c77 ("mm: change mprotect_fixup to vma iterator")
Link: https://lore.kernel.org/linux-mm/20230223120407.729110a6ecd1416ac59d9cb0@linux-foundation.org/
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Bert Karwatzki <spasswolf@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217061
Tested-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
--- a/mm/mprotect.c~mm-mprotect-fix-successful-vma_merge-of-next-in-do_mprotect_pkey
+++ b/mm/mprotect.c
@@ -832,6 +832,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
if (error)
break;
+ tmp = vma_iter_end(&vmi);
nstart = tmp;
prot = reqprot;
}
_
^ permalink raw reply [flat|nested] 33+ messages in thread