Re: unusual behavior of loop dev with backing file in tmpfs

From: "Darrick J. Wong" <djwong@kernel.org>
To: Hugh Dickins <hughd@google.com>
Cc: Lukas Czerner <lczerner@redhat.com>,
	Mikulas Patocka <mpatocka@redhat.com>,
	Zdenek Kabelac <zkabelac@redhat.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Subject: Re: unusual behavior of loop dev with backing file in tmpfs
Date: Wed, 12 Jan 2022 09:19:37 -0800	[thread overview]
Message-ID: <20220112171937.GA19154@magnolia> (raw)
In-Reply-To: <5e66a9-4739-80d9-5bb5-cbe2c8fef36@google.com>

On Tue, Jan 11, 2022 at 08:28:02PM -0800, Hugh Dickins wrote:
> On Fri, 26 Nov 2021, Lukas Czerner wrote:
> > 
> > I've noticed unusual test failure in e2fsprogs testsuite
> > (m_assume_storage_prezeroed) where we use mke2fs to create a file system
> > on loop device backed in file on tmpfs. For some reason sometimes the
> > resulting file number of allocated blocks (stat -c '%b' /tmp/file) differs,
> > but it really should not.
> > 
> > I was trying to create a simplified reproducer and noticed the following
> > behavior on mainline kernel (v5.16-rc2-54-g5d9f4cf36721)
> > 
> > # truncate -s16M /tmp/file
> > # stat -c '%b' /tmp/file
> > 0
> > 
> > # losetup -f /tmp/file
> > # stat -c '%b' /tmp/file
> > 672
> > 
> > That alone is a little unexpected since the file is really supposed to
> > be empty and when copied out of the tmpfs, it really is empty. But the
> > following is even more weird.
> > 
> > We have a loop setup from above, so let's assume it's /dev/loop0. The
> > following should be executed in quick succession, like in a script.
> > 
> > # dd if=/dev/zero of=/dev/loop0 bs=4k
> > # blkdiscard -f /dev/loop0
> > # stat -c '%b' /tmp/file
> > 0
> > # sleep 1
> > # stat -c '%b' /tmp/file
> > 672
> > 
> > Is that expected behavior ? From what I've seen when I use mkfs instead
> > of this simplified example the number of blocks allocated as reported by
> > stat can vary a quite a lot given more complex operations. The file itself
> > does not seem to be corrupted in any way, so it is likely just an
> > accounting problem.
> > 
> > Any idea what is going on there ?
> 
> I have half an answer; but maybe you worked it all out meanwhile anyway.
> 
> Yes, it happens like that for me too: 672 (but 216 on an old installation).
> 
> Half the answer is that funny code at the head of shmem_file_read_iter():
> 	/*
> 	 * Might this read be for a stacking filesystem?  Then when reading
> 	 * holes of a sparse file, we actually need to allocate those pages,
> 	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
> 	 */
> 	if (!iter_is_iovec(to))
> 		sgp = SGP_CACHE;
> which allocates pages to the tmpfs for reads from /dev/loop0; whereas
> normally a read of a sparse tmpfs file would just give zeroes without
> allocating.
> 
> [Do we still need that code? Mikulas asked 18 months ago, and I never
> responded (sorry) because I failed to arrive at an informed answer.
> It comes from a time while unionfs on tmpfs was actively developing,
> and solved a real problem then; but by the time it went into tmpfs,
> unionfs had already been persuaded to proceed differently, and no
> longer needed it. I kept it in for indeterminate other stacking FSs,
> but it's probably just culted cargo, doing more harm than good. I
> suspect the best thing to do is, after the 5.17 merge window closes,
> revive Mikulas's patch to delete it and see if anyone complains.]

I for one wouldn't mind if tmpfs no longer instantiated cache pages for
a read from a hole -- it's a little strange, since most disk filesystems
(well ok xfs and ext4, haven't checked the others) don't do that.
Anyone who really wants a preallocated page should probably be using
fallocate or something...

--D

> But what is asynchronously reading /dev/loop0 (instantiating pages
> initially, and reinstantiating them after blkdiscard)? I assume it's
> some block device tracker, trying to read capacity and/or partition
> table; whether from inside or outside the kernel, I expect you'll
> guess much better than I can.
> 
> Hugh