fscache corruption in Linux 5.17?

* fscache corruption in Linux 5.17?
@ 2022-04-12 15:10 Max Kellermann
  2022-04-16 11:38 ` Thorsten Leemhuis
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Max Kellermann @ 2022-04-12 15:10 UTC (permalink / raw)
  To: dhowells; +Cc: linux-cachefs, linux-fsdevel, linux-kernel

Hi David,

two weeks ago, I updated a cluster of web servers to Linux kernel
5.17.1 (5.16.x previously) which includes your rewrite of the fscache
code.

In the last few days, there were numerous complaints about broken
WordPress installations after WordPress was updated.  There were
PHP syntax errors everywhere.

Indeed there were broken PHP files, but the interesting part is: those
corruptions were only on one of the web servers; the others were fine,
the file contents were only broken on one of the servers.

File size and time stamp and everyhing in "stat" is identical, just
the file contents are corrupted; it looks like a mix of old and new
contents.  The corruptions always started at multiples of 4096 bytes.

An example diff:

 --- ok/wp-includes/media.php    2022-04-06 05:51:50.000000000 +0200
 +++ broken/wp-includes/media.php    2022-04-06 05:51:50.000000000 +0200
 @@ -5348,7 +5348,7 @@
                 /**
                  * Filters the threshold for how many of the first content media elements to not lazy-load.
                  *
 -                * For these first content media elements, the `loading` attribute will be omitted. By default, this is the case
 +                * For these first content media elements, the `loading` efault, this is the case
                  * for only the very first content media element.
                  *
                  * @since 5.9.0
 @@ -5377,3 +5377,4 @@

         return $content_media_count;
  }
 +^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

The corruption can be explained by WordPress commit
https://github.com/WordPress/WordPress/commit/07855db0ee8d5cff2 which
makes the file 31 bytes longer (185055 -> 185086).  The "broken" web
server sees the new contents until offset 184320 (= 45 * 4096), but
sees the old contents from there on; followed by 31 null bytes
(because the kernel reads past the end of the cache?).

All web servers mount a storage via NFSv3 with fscache.

My suspicion is that this is caused by a fscache regression in Linux
5.17.  What do you think?

What can I do to debug this further, is there any information you
need?  I don't know much about how fscache works internally and how to
obtain information.

Max

^ permalink raw reply	[flat|nested] 20+ messages in thread