linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* unusual behavior of loop dev with backing file in tmpfs
@ 2021-11-26  7:51 Lukas Czerner
  2022-01-12  4:28 ` Hugh Dickins
  0 siblings, 1 reply; 5+ messages in thread
From: Lukas Czerner @ 2021-11-26  7:51 UTC (permalink / raw)
  To: hughd, linux-mm; +Cc: linux-fsdevel

Hello,

I've noticed unusual test failure in e2fsprogs testsuite
(m_assume_storage_prezeroed) where we use mke2fs to create a file system
on loop device backed in file on tmpfs. For some reason sometimes the
resulting file number of allocated blocks (stat -c '%b' /tmp/file) differs,
but it really should not.

I was trying to create a simplified reproducer and noticed the following
behavior on mainline kernel (v5.16-rc2-54-g5d9f4cf36721)

# truncate -s16M /tmp/file
# stat -c '%b' /tmp/file
0

# losetup -f /tmp/file
# stat -c '%b' /tmp/file
672

That alone is a little unexpected since the file is really supposed to
be empty and when copied out of the tmpfs, it really is empty. But the
following is even more weird.

We have a loop setup from above, so let's assume it's /dev/loop0. The
following should be executed in quick succession, like in a script.

# dd if=/dev/zero of=/dev/loop0 bs=4k
# blkdiscard -f /dev/loop0
# stat -c '%b' /tmp/file
0
# sleep 1
# stat -c '%b' /tmp/file
672

Is that expected behavior ? From what I've seen when I use mkfs instead
of this simplified example the number of blocks allocated as reported by
stat can vary a quite a lot given more complex operations. The file itself
does not seem to be corrupted in any way, so it is likely just an
accounting problem.

Any idea what is going on there ?

Thanks!
-Lukas



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: unusual behavior of loop dev with backing file in tmpfs
  2021-11-26  7:51 unusual behavior of loop dev with backing file in tmpfs Lukas Czerner
@ 2022-01-12  4:28 ` Hugh Dickins
  2022-01-12 12:29   ` Mikulas Patocka
  2022-01-12 17:19   ` Darrick J. Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Hugh Dickins @ 2022-01-12  4:28 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Mikulas Patocka, Zdenek Kabelac, hughd, linux-mm, linux-fsdevel

On Fri, 26 Nov 2021, Lukas Czerner wrote:
> 
> I've noticed unusual test failure in e2fsprogs testsuite
> (m_assume_storage_prezeroed) where we use mke2fs to create a file system
> on loop device backed in file on tmpfs. For some reason sometimes the
> resulting file number of allocated blocks (stat -c '%b' /tmp/file) differs,
> but it really should not.
> 
> I was trying to create a simplified reproducer and noticed the following
> behavior on mainline kernel (v5.16-rc2-54-g5d9f4cf36721)
> 
> # truncate -s16M /tmp/file
> # stat -c '%b' /tmp/file
> 0
> 
> # losetup -f /tmp/file
> # stat -c '%b' /tmp/file
> 672
> 
> That alone is a little unexpected since the file is really supposed to
> be empty and when copied out of the tmpfs, it really is empty. But the
> following is even more weird.
> 
> We have a loop setup from above, so let's assume it's /dev/loop0. The
> following should be executed in quick succession, like in a script.
> 
> # dd if=/dev/zero of=/dev/loop0 bs=4k
> # blkdiscard -f /dev/loop0
> # stat -c '%b' /tmp/file
> 0
> # sleep 1
> # stat -c '%b' /tmp/file
> 672
> 
> Is that expected behavior ? From what I've seen when I use mkfs instead
> of this simplified example the number of blocks allocated as reported by
> stat can vary a quite a lot given more complex operations. The file itself
> does not seem to be corrupted in any way, so it is likely just an
> accounting problem.
> 
> Any idea what is going on there ?

I have half an answer; but maybe you worked it all out meanwhile anyway.

Yes, it happens like that for me too: 672 (but 216 on an old installation).

Half the answer is that funny code at the head of shmem_file_read_iter():
	/*
	 * Might this read be for a stacking filesystem?  Then when reading
	 * holes of a sparse file, we actually need to allocate those pages,
	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
	 */
	if (!iter_is_iovec(to))
		sgp = SGP_CACHE;
which allocates pages to the tmpfs for reads from /dev/loop0; whereas
normally a read of a sparse tmpfs file would just give zeroes without
allocating.

[Do we still need that code? Mikulas asked 18 months ago, and I never
responded (sorry) because I failed to arrive at an informed answer.
It comes from a time while unionfs on tmpfs was actively developing,
and solved a real problem then; but by the time it went into tmpfs,
unionfs had already been persuaded to proceed differently, and no
longer needed it. I kept it in for indeterminate other stacking FSs,
but it's probably just culted cargo, doing more harm than good. I
suspect the best thing to do is, after the 5.17 merge window closes,
revive Mikulas's patch to delete it and see if anyone complains.]

But what is asynchronously reading /dev/loop0 (instantiating pages
initially, and reinstantiating them after blkdiscard)? I assume it's
some block device tracker, trying to read capacity and/or partition
table; whether from inside or outside the kernel, I expect you'll
guess much better than I can.

Hugh


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: unusual behavior of loop dev with backing file in tmpfs
  2022-01-12  4:28 ` Hugh Dickins
@ 2022-01-12 12:29   ` Mikulas Patocka
  2022-01-12 17:19   ` Darrick J. Wong
  1 sibling, 0 replies; 5+ messages in thread
From: Mikulas Patocka @ 2022-01-12 12:29 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Lukas Czerner, Zdenek Kabelac, linux-mm, linux-fsdevel



On Tue, 11 Jan 2022, Hugh Dickins wrote:

> But what is asynchronously reading /dev/loop0 (instantiating pages
> initially, and reinstantiating them after blkdiscard)? I assume it's
> some block device tracker, trying to read capacity and/or partition
> table; whether from inside or outside the kernel, I expect you'll
> guess much better than I can.
> 
> Hugh

That's udev. It reads filesystem signature on every block device, so that 
it can create symlinks in /dev/disk/by-uuid.

If you open the block device for write and close it, udev will re-scan it.

Mikulas



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: unusual behavior of loop dev with backing file in tmpfs
  2022-01-12  4:28 ` Hugh Dickins
  2022-01-12 12:29   ` Mikulas Patocka
@ 2022-01-12 17:19   ` Darrick J. Wong
  2022-01-12 17:46     ` Matthew Wilcox
  1 sibling, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2022-01-12 17:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lukas Czerner, Mikulas Patocka, Zdenek Kabelac, linux-mm, linux-fsdevel

On Tue, Jan 11, 2022 at 08:28:02PM -0800, Hugh Dickins wrote:
> On Fri, 26 Nov 2021, Lukas Czerner wrote:
> > 
> > I've noticed unusual test failure in e2fsprogs testsuite
> > (m_assume_storage_prezeroed) where we use mke2fs to create a file system
> > on loop device backed in file on tmpfs. For some reason sometimes the
> > resulting file number of allocated blocks (stat -c '%b' /tmp/file) differs,
> > but it really should not.
> > 
> > I was trying to create a simplified reproducer and noticed the following
> > behavior on mainline kernel (v5.16-rc2-54-g5d9f4cf36721)
> > 
> > # truncate -s16M /tmp/file
> > # stat -c '%b' /tmp/file
> > 0
> > 
> > # losetup -f /tmp/file
> > # stat -c '%b' /tmp/file
> > 672
> > 
> > That alone is a little unexpected since the file is really supposed to
> > be empty and when copied out of the tmpfs, it really is empty. But the
> > following is even more weird.
> > 
> > We have a loop setup from above, so let's assume it's /dev/loop0. The
> > following should be executed in quick succession, like in a script.
> > 
> > # dd if=/dev/zero of=/dev/loop0 bs=4k
> > # blkdiscard -f /dev/loop0
> > # stat -c '%b' /tmp/file
> > 0
> > # sleep 1
> > # stat -c '%b' /tmp/file
> > 672
> > 
> > Is that expected behavior ? From what I've seen when I use mkfs instead
> > of this simplified example the number of blocks allocated as reported by
> > stat can vary a quite a lot given more complex operations. The file itself
> > does not seem to be corrupted in any way, so it is likely just an
> > accounting problem.
> > 
> > Any idea what is going on there ?
> 
> I have half an answer; but maybe you worked it all out meanwhile anyway.
> 
> Yes, it happens like that for me too: 672 (but 216 on an old installation).
> 
> Half the answer is that funny code at the head of shmem_file_read_iter():
> 	/*
> 	 * Might this read be for a stacking filesystem?  Then when reading
> 	 * holes of a sparse file, we actually need to allocate those pages,
> 	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
> 	 */
> 	if (!iter_is_iovec(to))
> 		sgp = SGP_CACHE;
> which allocates pages to the tmpfs for reads from /dev/loop0; whereas
> normally a read of a sparse tmpfs file would just give zeroes without
> allocating.
> 
> [Do we still need that code? Mikulas asked 18 months ago, and I never
> responded (sorry) because I failed to arrive at an informed answer.
> It comes from a time while unionfs on tmpfs was actively developing,
> and solved a real problem then; but by the time it went into tmpfs,
> unionfs had already been persuaded to proceed differently, and no
> longer needed it. I kept it in for indeterminate other stacking FSs,
> but it's probably just culted cargo, doing more harm than good. I
> suspect the best thing to do is, after the 5.17 merge window closes,
> revive Mikulas's patch to delete it and see if anyone complains.]

I for one wouldn't mind if tmpfs no longer instantiated cache pages for
a read from a hole -- it's a little strange, since most disk filesystems
(well ok xfs and ext4, haven't checked the others) don't do that.
Anyone who really wants a preallocated page should probably be using
fallocate or something...

--D

> But what is asynchronously reading /dev/loop0 (instantiating pages
> initially, and reinstantiating them after blkdiscard)? I assume it's
> some block device tracker, trying to read capacity and/or partition
> table; whether from inside or outside the kernel, I expect you'll
> guess much better than I can.
> 
> Hugh


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: unusual behavior of loop dev with backing file in tmpfs
  2022-01-12 17:19   ` Darrick J. Wong
@ 2022-01-12 17:46     ` Matthew Wilcox
  0 siblings, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2022-01-12 17:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Hugh Dickins, Lukas Czerner, Mikulas Patocka, Zdenek Kabelac,
	linux-mm, linux-fsdevel

On Wed, Jan 12, 2022 at 09:19:37AM -0800, Darrick J. Wong wrote:
> I for one wouldn't mind if tmpfs no longer instantiated cache pages for
> a read from a hole -- it's a little strange, since most disk filesystems
> (well ok xfs and ext4, haven't checked the others) don't do that.
> Anyone who really wants a preallocated page should probably be using
> fallocate or something...

We don't allocate disk blocks, but we do allocate pages.

filemap_read()
  filemap_get_pages()
    page_cache_sync_readahead()
      page_cache_sync_ra()
        ondemand_readahead()
	  do_page_cache_ra()
	    page_cache_ra_unbounded()
	      __page_cache_alloc()
	      add_to_page_cache_lru()

At this point, we haven't called into the filesystem, so we don't
know that we're allocating pages for a hole.

Although tmpfs doesn't take this path; it has its own
shmem_file_read_iter() instead of calling filemap_read().  I do rather
regret that because it means that tmpfs doesn't take advantage of
readahead, which means that swapping back _in_ is rather slow.

It's on my list of things to look at ... eventually.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-12 17:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-26  7:51 unusual behavior of loop dev with backing file in tmpfs Lukas Czerner
2022-01-12  4:28 ` Hugh Dickins
2022-01-12 12:29   ` Mikulas Patocka
2022-01-12 17:19   ` Darrick J. Wong
2022-01-12 17:46     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).