linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
@ 2018-11-08 17:59 Elana Hashman
  2018-11-08 18:13 ` Reindl Harald
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Elana Hashman @ 2018-11-08 17:59 UTC (permalink / raw)
  To: 'tytso@mit.edu'; +Cc: 'linux-ext4@vger.kernel.org'

Hi Ted,

We've run into a mysterious "phantom" full filesystem issue on our Kubernetes fleet. We initially encountered this issue on kernel 4.1.35, but are still experiencing the problem after upgrading to 4.14.67. Essentially, `df` reports our root filesystems as full and they behave as though they are full, but the "used" space cannot be accounted for. Rebooting the system, remounting the root filesystem read-only and then remounting as read-write, or booting into single-user mode all free up the "used" space. The disk slowly fills up over time, suggesting that there might be some kind of leak; we previously saw this affecting hosts with ~200 days of uptime on the 4.1 kernel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.

Here is some data from an example host, running the 4.14.67 kernel. The root disk is ext4.

$ uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
$ grep ' / ' /proc/mounts
/dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0

`df` reports 0 bytes free:

$ df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid>   50G   48G     0 100% /

Deleted, open files account for almost no disk capacity:

$ sudo lsof -a +L1 /
COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
java      5313 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
java      5313 user   11u   REG    8,3    55185     0 2494654 /tmp/classpath.1668Gp (deleted)
system_ar 5333 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
java      5421 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
java      5421 user   11u   REG    8,3   149313     0 2494486 /tmp/java.fzTwWp (deleted)
java      5421 tsdist   12u   REG    8,3    55185     0 2500513 /tmp/classpath.7AmxHO (deleted)

`du` can only account for 16GB of file usage:

$ sudo du -hxs /
16G     /

But what is most puzzling is the numbers reported by e2freefrag, which don't add up:

$ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
Device: /dev/disk/by-uuid/<some-uuid>
Blocksize: 4096 bytes
Total blocks: 13107200
Free blocks: 7778076 (59.3%)

Min. free extent: 4 KB
Max. free extent: 8876 KB
Avg. free extent: 224 KB
Num. free extent: 6098

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :          1205          1205    0.02%
    8K...   16K-  :           980          2265    0.03%
   16K...   32K-  :           653          3419    0.04%
   32K...   64K-  :          1337         15374    0.20%
   64K...  128K-  :           631         14151    0.18%
  128K...  256K-  :           224         10205    0.13%
  256K...  512K-  :           261         23818    0.31%
  512K... 1024K-  :           303         56801    0.73%
    1M...    2M-  :           387        135907    1.75%
    2M...    4M-  :           103         64740    0.83%
    4M...    8M-  :            12         15005    0.19%
    8M...   16M-  :             2          4267    0.05%

This looks like a bug to me; the histogram in the manpage example has percentages that add up to 100% but this doesn't even add up to 5%.

After a reboot, `df` reflects real utilization:

$ df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid>   50G   16G   31G  34% /

We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.

Thanks for your help,

--
Elana Hashman
ehashman@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2018-11-08 17:59 Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Elana Hashman
@ 2018-11-08 18:13 ` Reindl Harald
  2018-11-08 18:20   ` Elana Hashman
  2018-11-08 18:47 ` Darrick J. Wong
  2019-01-24  1:54 ` Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Liu Bo
  2 siblings, 1 reply; 21+ messages in thread
From: Reindl Harald @ 2018-11-08 18:13 UTC (permalink / raw)
  To: Elana Hashman, 'tytso@mit.edu'
  Cc: 'linux-ext4@vger.kernel.org'



Am 08.11.18 um 18:59 schrieb Elana Hashman:
> After a reboot, `df` reflects real utilization:
> 
> $ df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   16G   31G  34% /
> 
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.

and what says "df -hi"?
maybe you are running simply out of inodes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2018-11-08 18:13 ` Reindl Harald
@ 2018-11-08 18:20   ` Elana Hashman
  0 siblings, 0 replies; 21+ messages in thread
From: Elana Hashman @ 2018-11-08 18:20 UTC (permalink / raw)
  To: 'Reindl Harald', 'tytso@mit.edu'
  Cc: 'linux-ext4@vger.kernel.org'

-----Original Message-----
From: Reindl Harald <h.reindl@thelounge.net> 
Sent: Thursday, November 8, 2018 1:14 PM

> and what says "df -hi"?
> maybe you are running simply out of inodes

We are nowhere near the inode limit:

$ df -hi /
Filesystem                                             Inodes IUsed IFree IUse% Mounted on
/dev/disk/by-uuid/<some-uuid>  3.2M  191K  3.0M    6% /

I can't confirm 100% since we rebooted this host and I don't have a candidate demonstrating the problem right now that I can check, but it's something we've checked previously.

- e

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2018-11-08 17:59 Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Elana Hashman
  2018-11-08 18:13 ` Reindl Harald
@ 2018-11-08 18:47 ` Darrick J. Wong
  2018-12-05 16:26   ` Elana Hashman
  2019-01-24  1:54 ` Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Liu Bo
  2 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2018-11-08 18:47 UTC (permalink / raw)
  To: Elana Hashman
  Cc: 'tytso@mit.edu', 'linux-ext4@vger.kernel.org'

On Thu, Nov 08, 2018 at 05:59:18PM +0000, Elana Hashman wrote:
> Hi Ted,
> 
> We've run into a mysterious "phantom" full filesystem issue on our Kubernetes fleet. We initially encountered this issue on kernel 4.1.35, but are still experiencing the problem after upgrading to 4.14.67. Essentially, `df` reports our root filesystems as full and they behave as though they are full, but the "used" space cannot be accounted for. Rebooting the system, remounting the root filesystem read-only and then remounting as read-write, or booting into single-user mode all free up the "used" space. The disk slowly fills up over time, suggesting that there might be some kind of leak; we previously saw this affecting hosts with ~200 days of uptime on the 4.1 kernel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.
> 
> Here is some data from an example host, running the 4.14.67 kernel. The root disk is ext4.
> 
> $ uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> $ grep ' / ' /proc/mounts
> /dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
> 
> `df` reports 0 bytes free:
> 
> $ df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   48G     0 100% /

This is very odd.  I wonder, how many of those overlayfses are still
mounted on the system at this point?  Over in xfs land we've discovered
that overlayfs subtly changes the lifetime behavior of incore inodes,
maybe that's what's going on here?  (Pure speculation on my part...)

> Deleted, open files account for almost no disk capacity:
> 
> $ sudo lsof -a +L1 /
> COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
> java      5313 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
> java      5313 user   11u   REG    8,3    55185     0 2494654 /tmp/classpath.1668Gp (deleted)
> system_ar 5333 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
> java      5421 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
> java      5421 user   11u   REG    8,3   149313     0 2494486 /tmp/java.fzTwWp (deleted)
> java      5421 tsdist   12u   REG    8,3    55185     0 2500513 /tmp/classpath.7AmxHO (deleted)
> 
> `du` can only account for 16GB of file usage:
> 
> $ sudo du -hxs /
> 16G     /
> 
> But what is most puzzling is the numbers reported by e2freefrag, which don't add up:
> 
> $ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
> Device: /dev/disk/by-uuid/<some-uuid>
> Blocksize: 4096 bytes
> Total blocks: 13107200
> Free blocks: 7778076 (59.3%)
> 
> Min. free extent: 4 KB
> Max. free extent: 8876 KB
> Avg. free extent: 224 KB
> Num. free extent: 6098
> 
> HISTOGRAM OF FREE EXTENT SIZES:
> Extent Size Range :  Free extents   Free Blocks  Percent
>     4K...    8K-  :          1205          1205    0.02%
>     8K...   16K-  :           980          2265    0.03%
>    16K...   32K-  :           653          3419    0.04%
>    32K...   64K-  :          1337         15374    0.20%
>    64K...  128K-  :           631         14151    0.18%
>   128K...  256K-  :           224         10205    0.13%
>   256K...  512K-  :           261         23818    0.31%
>   512K... 1024K-  :           303         56801    0.73%
>     1M...    2M-  :           387        135907    1.75%
>     2M...    4M-  :           103         64740    0.83%
>     4M...    8M-  :            12         15005    0.19%
>     8M...   16M-  :             2          4267    0.05%

Clearly a bug in e2freefrag, the percentages are supposed to sum to 100.
Patches soon.

--D

> 
> This looks like a bug to me; the histogram in the manpage example has percentages that add up to 100% but this doesn't even add up to 5%.
> 
> After a reboot, `df` reflects real utilization:
> 
> $ df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   16G   31G  34% /
> 
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.
> 
> Thanks for your help,
> 
> --
> Elana Hashman
> ehashman@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2018-11-08 18:47 ` Darrick J. Wong
@ 2018-12-05 16:26   ` Elana Hashman
  2019-01-23 19:59     ` Thomas Walker
  0 siblings, 1 reply; 21+ messages in thread
From: Elana Hashman @ 2018-12-05 16:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: 'tytso@mit.edu', 'linux-ext4@vger.kernel.org',
	Thomas Walker

Okay, let's take a look at another affected host. I have not drained it, just cordoned it, so it's still in Kubernetes service and has running, active pods.

$ uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux

$ df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid>   50G   46G  1.6G  97% /

$ df -hi /
Filesystem                                             Inodes IUsed IFree IUse% Mounted on
/dev/disk/by-uuid/<some-uuid>   3.2M  203K  3.0M    7% /

$ sudo du -hxs /
21G     /

$ sudo lsof -a +L1 /
lsof: WARNING: can't stat() fuse file system /srv/kubelet/pods/<some-path>
      Output information may be incomplete.
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
java    6392 user     11u   REG    8,3    55185     0 1441946 /tmp/classpath.ln0XhI (deleted)
java    6481 user     11u   REG    8,3   149313     0 1441945 /tmp/java.AwFIiw (deleted)
java    6481 user     12u   REG    8,3    55185     0 1441951 /tmp/classpath.qrQhkS (deleted)


There are many overlay mounts currently active:

$ mount | grep overlay | wc -l
40

Also some fuse mounts (as mentioned in the lsof warning on this particular machine):

$ mount | grep fuse | wc -l
21

And just to double-check, all these modules are running in-tree:

$ modinfo ext4 | grep intree
intree:         Y
$ modinfo overlay | grep intree
intree:         Y
$ modinfo fuse | grep intree
intree:         Y

- e

-----Original Message-----
From: Darrick J. Wong <darrick.wong@oracle.com> 
Sent: Thursday, November 8, 2018 1:47 PM
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

This is very odd.  I wonder, how many of those overlayfses are still mounted on the system at this point?  Over in xfs land we've discovered that overlayfs subtly changes the lifetime behavior of incore inodes, maybe that's what's going on here?  (Pure speculation on my part...)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2018-12-05 16:26   ` Elana Hashman
@ 2019-01-23 19:59     ` Thomas Walker
  2019-06-26 15:17       ` Thomas Walker
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Walker @ 2019-01-23 19:59 UTC (permalink / raw)
  To: Elana Hashman
  Cc: Darrick J. Wong, 'tytso@mit.edu',
	'linux-ext4@vger.kernel.org'

Unfortunately this still continues to be a persistent problem for us.  On another example system:

# uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux

# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/<uuid>                                 50G   46G  1.1G  98% /

# df -hi /
Filesystem                                             Inodes IUsed IFree IUse% Mounted on
/dev/disk/by-uuid/<uuid>                                 3.2M  306K  2.9M   10% /

# du -hsx  /
14G     /

And confirmed not to be due to sparse files or deleted but still open files.

The most interesting thing that I've been able to find so far is this:

# mount -o remount,ro /
mount: / is busy
# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/<uuid>                                 50G   14G   33G  30% /

Something about attempting (and failing) to remount read-only frees up all of the phantom space usage.
Curious whether that sparks ideas in anyone's mind?

I've tried all manner of other things without success.  Unmounting all of the overlays.  Killing off virtually all of usersapce (dropping to single user).  Dropping page/inode/dentry caches.Nothing else (short of a reboot) seems to give us the space back.


On Wed, Dec 05, 2018 at 11:26:19AM -0500, Elana Hashman wrote:
> Okay, let's take a look at another affected host. I have not drained it, just
> cordoned it, so it's still in Kubernetes service and has running, active pods.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2018-11-08 17:59 Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Elana Hashman
  2018-11-08 18:13 ` Reindl Harald
  2018-11-08 18:47 ` Darrick J. Wong
@ 2019-01-24  1:54 ` Liu Bo
  2019-01-24 14:40   ` Elana Hashman
  2 siblings, 1 reply; 21+ messages in thread
From: Liu Bo @ 2019-01-24  1:54 UTC (permalink / raw)
  To: Elana Hashman; +Cc: tytso, linux-ext4

On Thu, Nov 8, 2018 at 10:11 AM Elana Hashman
<Elana.Hashman@twosigma.com> wrote:
>
> Hi Ted,
>
> We've run into a mysterious "phantom" full filesystem issue on our Kubernetes fleet. We initially encountered this issue on kernel 4.1.35, but are still experiencing the problem after upgrading to 4.14.67. Essentially, `df` reports our root filesystems as full and they behave as though they are full, but the "used" space cannot be accounted for. Rebooting the system, remounting the root filesystem read-only and then remounting as read-write, or booting into single-user mode all free up the "used" space. The disk slowly fills up over time, suggesting that there might be some kind of leak; we previously saw this affecting hosts with ~200 days of uptime on the 4.1 kernel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.
>

I wonder if this ext4 enabled bigalloc (can be checked by dumpe2fs -h $disk).
So bigalloc is known to cause leak space, and it's been just fixed recently.

thanks,
liubo

> Here is some data from an example host, running the 4.14.67 kernel. The root disk is ext4.
>
> $ uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> $ grep ' / ' /proc/mounts
> /dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
>
> `df` reports 0 bytes free:
>
> $ df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   48G     0 100% /
>
> Deleted, open files account for almost no disk capacity:
>
> $ sudo lsof -a +L1 /
> COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
> java      5313 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
> java      5313 user   11u   REG    8,3    55185     0 2494654 /tmp/classpath.1668Gp (deleted)
> system_ar 5333 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
> java      5421 user    3r   REG    8,3  6806312     0 1315847 /var/lib/sss/mc/passwd (deleted)
> java      5421 user   11u   REG    8,3   149313     0 2494486 /tmp/java.fzTwWp (deleted)
> java      5421 tsdist   12u   REG    8,3    55185     0 2500513 /tmp/classpath.7AmxHO (deleted)
>
> `du` can only account for 16GB of file usage:
>
> $ sudo du -hxs /
> 16G     /
>
> But what is most puzzling is the numbers reported by e2freefrag, which don't add up:
>
> $ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
> Device: /dev/disk/by-uuid/<some-uuid>
> Blocksize: 4096 bytes
> Total blocks: 13107200
> Free blocks: 7778076 (59.3%)
>
> Min. free extent: 4 KB
> Max. free extent: 8876 KB
> Avg. free extent: 224 KB
> Num. free extent: 6098
>
> HISTOGRAM OF FREE EXTENT SIZES:
> Extent Size Range :  Free extents   Free Blocks  Percent
>     4K...    8K-  :          1205          1205    0.02%
>     8K...   16K-  :           980          2265    0.03%
>    16K...   32K-  :           653          3419    0.04%
>    32K...   64K-  :          1337         15374    0.20%
>    64K...  128K-  :           631         14151    0.18%
>   128K...  256K-  :           224         10205    0.13%
>   256K...  512K-  :           261         23818    0.31%
>   512K... 1024K-  :           303         56801    0.73%
>     1M...    2M-  :           387        135907    1.75%
>     2M...    4M-  :           103         64740    0.83%
>     4M...    8M-  :            12         15005    0.19%
>     8M...   16M-  :             2          4267    0.05%
>
> This looks like a bug to me; the histogram in the manpage example has percentages that add up to 100% but this doesn't even add up to 5%.
>
> After a reboot, `df` reflects real utilization:
>
> $ df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   16G   31G  34% /
>
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.
>
> Thanks for your help,
>
> --
> Elana Hashman
> ehashman@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-01-24  1:54 ` Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Liu Bo
@ 2019-01-24 14:40   ` Elana Hashman
  0 siblings, 0 replies; 21+ messages in thread
From: Elana Hashman @ 2019-01-24 14:40 UTC (permalink / raw)
  To: Liu Bo; +Cc: tytso, linux-ext4, Thomas Walker

From: Liu Bo <obuil.liubo@gmail.com> 

> I wonder if this ext4 enabled bigalloc (can be checked by dumpe2fs -h $disk).
> So bigalloc is known to cause leak space, and it's been just fixed recently.

Doesn't look that way:

$ uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
$ sudo du -hxs /
20G     /
$ df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid>   50G   46G  1.1G  98% /
$ sudo dumpe2fs -h /dev/disk/by-uuid/<some-uuid>
dumpe2fs 1.42.5 (29-Jul-2012)
Filesystem volume name:   ROOT
Last mounted on:          /
Filesystem UUID:          <some-uuid>
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3276800
Block count:              13107200
Reserved block count:     655360
Free blocks:              9260751
Free inodes:              3078348
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1020
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Tue Jan 16 18:48:13 2018
Last mount time:          Mon Oct  1 20:43:59 2018
Last write time:          Mon Oct  1 20:43:58 2018
Mount count:              7
Maximum mount count:      -1
Last checked:             Tue Jan 16 18:48:13 2018
Check interval:           0 (<none>)
Lifetime writes:          7159 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       2231477
Default directory hash:   half_md4
Directory Hash Seed:      8a6f6d03-37e7-409c-b8c9-d738e0b0a94a
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x02a09116
Journal start:            24688


--
Elana Hashman
ehashman@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-01-23 19:59     ` Thomas Walker
@ 2019-06-26 15:17       ` Thomas Walker
  2019-07-11  9:23         ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Walker @ 2019-06-26 15:17 UTC (permalink / raw)
  To: 'linux-ext4@vger.kernel.org'
  Cc: Darrick J. Wong, 'tytso@mit.edu', Geoffrey Thomas

Sorry to revive a rather old thread, but Elana mentioned that there might have been a related fix recently?  Possibly something to do with truncate? 
A quick scan of the last month or so turned up https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these seemed obviously applicable to me.
We do still experience this phantom space usage quite frequently (although the remount workaround below has lowered the priority). 

On Wed, Jan 23, 2019 at 02:59:22PM -0500, Thomas Walker wrote:
> Unfortunately this still continues to be a persistent problem for us.  On another example system:
> 
> # uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> 
> # df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<uuid>                                 50G   46G  1.1G  98% /
> 
> # df -hi /
> Filesystem                                             Inodes IUsed IFree IUse% Mounted on
> /dev/disk/by-uuid/<uuid>                                 3.2M  306K  2.9M   10% /
> 
> # du -hsx  /
> 14G     /
> 
> And confirmed not to be due to sparse files or deleted but still open files.
> 
> The most interesting thing that I've been able to find so far is this:
> 
> # mount -o remount,ro /
> mount: / is busy
> # df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<uuid>                                 50G   14G   33G  30% /
> 
> Something about attempting (and failing) to remount read-only frees up all of the phantom space usage.
> Curious whether that sparks ideas in anyone's mind?
> 
> I've tried all manner of other things without success.  Unmounting all of the overlays.  Killing off virtually all of usersapce (dropping to single user).  Dropping page/inode/dentry caches.Nothing else (short of a reboot) seems to give us the space back.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-06-26 15:17       ` Thomas Walker
@ 2019-07-11  9:23         ` Jan Kara
  2019-07-11 14:40           ` Geoffrey Thomas
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2019-07-11  9:23 UTC (permalink / raw)
  To: Thomas Walker
  Cc: 'linux-ext4@vger.kernel.org',
	Darrick J. Wong, 'tytso@mit.edu',
	Geoffrey Thomas

On Wed 26-06-19 11:17:54, Thomas Walker wrote:
> Sorry to revive a rather old thread, but Elana mentioned that there might
> have been a related fix recently?  Possibly something to do with
> truncate?  A quick scan of the last month or so turned up
> https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these
> seemed obviously applicable to me.  We do still experience this phantom
> space usage quite frequently (although the remount workaround below has
> lowered the priority). 

I don't recall any fix for this. But seeing that remount "fixes" the issue
for you can you try whether one of the following has a similar effect?

1) Try "sync"
2) Try "fsfreeze -f / && fsfreeze -u /"
3) Try "echo 3 >/proc/sys/vm/drop_caches"

Also what is the contents of
/sys/fs/ext4/<problematic-device>/delayed_allocation_blocks
when the issue happens?

								Honza

> 
> On Wed, Jan 23, 2019 at 02:59:22PM -0500, Thomas Walker wrote:
> > Unfortunately this still continues to be a persistent problem for us.  On another example system:
> > 
> > # uname -a
> > Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> > 
> > # df -h /
> > Filesystem                                              Size  Used Avail Use% Mounted on
> > /dev/disk/by-uuid/<uuid>                                 50G   46G  1.1G  98% /
> > 
> > # df -hi /
> > Filesystem                                             Inodes IUsed IFree IUse% Mounted on
> > /dev/disk/by-uuid/<uuid>                                 3.2M  306K  2.9M   10% /
> > 
> > # du -hsx  /
> > 14G     /
> > 
> > And confirmed not to be due to sparse files or deleted but still open files.
> > 
> > The most interesting thing that I've been able to find so far is this:
> > 
> > # mount -o remount,ro /
> > mount: / is busy
> > # df -h /
> > Filesystem                                              Size  Used Avail Use% Mounted on
> > /dev/disk/by-uuid/<uuid>                                 50G   14G   33G  30% /
> > 
> > Something about attempting (and failing) to remount read-only frees up all of the phantom space usage.
> > Curious whether that sparks ideas in anyone's mind?
> > 
> > I've tried all manner of other things without success.  Unmounting all of the overlays.  Killing off virtually all of usersapce (dropping to single user).  Dropping page/inode/dentry caches.Nothing else (short of a reboot) seems to give us the space back.
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-11  9:23         ` Jan Kara
@ 2019-07-11 14:40           ` Geoffrey Thomas
  2019-07-11 15:23             ` Jan Kara
  2019-07-11 17:10             ` Theodore Ts'o
  0 siblings, 2 replies; 21+ messages in thread
From: Geoffrey Thomas @ 2019-07-11 14:40 UTC (permalink / raw)
  To: 'Jan Kara', Thomas Walker
  Cc: 'linux-ext4@vger.kernel.org',
	Darrick J. Wong, 'tytso@mit.edu'

On Thursday, July 11, 2019 5:23 AM, Jan Kara <jack@suse.cz> wrote: 
> On Wed 26-06-19 11:17:54, Thomas Walker wrote:
> > Sorry to revive a rather old thread, but Elana mentioned that there might
> > have been a related fix recently?  Possibly something to do with
> > truncate?  A quick scan of the last month or so turned up
> > https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these
> > seemed obviously applicable to me.  We do still experience this phantom
> > space usage quite frequently (although the remount workaround below has
> > lowered the priority).
> 
> I don't recall any fix for this. But seeing that remount "fixes" the issue
> for you can you try whether one of the following has a similar effect?
> 
> 1) Try "sync"
> 2) Try "fsfreeze -f / && fsfreeze -u /"
> 3) Try "echo 3 >/proc/sys/vm/drop_caches"
> 
> Also what is the contents of
> /sys/fs/ext4/<problematic-device>/delayed_allocation_blocks
> when the issue happens?

We just had one of these today, and no luck from any of those. delayed_allocation_blocks is 1:

# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299   50G   47G  455M 100% /
# ls -l /dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299
lrwxrwxrwx 1 root root 10 Mar  8 16:03 /dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 -> ../../sda3
# cat /sys/fs/ext4/sda3/delayed_allocation_blocks
1
# sync; sync; sync
# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299   50G   47G  455M 100% /
# fsfreeze -f /; fsfreeze -u /
# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299   50G   47G  455M 100% /
# echo 3 > /proc/sys/vm/drop_caches
[blocks for about 10 seconds]
# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299   50G   47G  454M 100% /
# umount /
[blocks for about 2 seconds]
umount: /: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
# df -h /
Filesystem                                              Size  Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299   50G   16G   32G  33% /
# uname -r
4.14.67-ts1

-- 
Geoffrey Thomas
geofft@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-11 14:40           ` Geoffrey Thomas
@ 2019-07-11 15:23             ` Jan Kara
  2019-07-11 17:10             ` Theodore Ts'o
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Kara @ 2019-07-11 15:23 UTC (permalink / raw)
  To: Geoffrey Thomas
  Cc: 'Jan Kara',
	Thomas Walker, 'linux-ext4@vger.kernel.org',
	Darrick J. Wong, 'tytso@mit.edu'

On Thu 11-07-19 14:40:43, Geoffrey Thomas wrote:
> On Thursday, July 11, 2019 5:23 AM, Jan Kara <jack@suse.cz> wrote: 
> > On Wed 26-06-19 11:17:54, Thomas Walker wrote:
> > > Sorry to revive a rather old thread, but Elana mentioned that there might
> > > have been a related fix recently?  Possibly something to do with
> > > truncate?  A quick scan of the last month or so turned up
> > > https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these
> > > seemed obviously applicable to me.  We do still experience this phantom
> > > space usage quite frequently (although the remount workaround below has
> > > lowered the priority).
> > 
> > I don't recall any fix for this. But seeing that remount "fixes" the issue
> > for you can you try whether one of the following has a similar effect?
> > 
> > 1) Try "sync"
> > 2) Try "fsfreeze -f / && fsfreeze -u /"
> > 3) Try "echo 3 >/proc/sys/vm/drop_caches"
> > 
> > Also what is the contents of
> > /sys/fs/ext4/<problematic-device>/delayed_allocation_blocks
> > when the issue happens?
> 
> We just had one of these today, and no luck from any of those.
> delayed_allocation_blocks is 1:

...

This is very strange because failed remount read-only (with EBUSY) doesn't
really do more than what "sync; echo 3 >/proc/sys/vm/drop_caches" does. I
suspect there's really some userspace taking up space and cleaning up on
umount. Anyway once this happens again, can you do:

fsfreeze -f /
e2image -r /dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 - | \
  xz >some_storage.xz
fsfreeze -u /

some_storage.xz can be on some usb stick or so. It will dump ext4 metadata
to the file. Then please provide some_storage.xz for download somewhere.
Thanks! BTW I'll be on vacation next two weeks so it will take a while to
get to this...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-11 14:40           ` Geoffrey Thomas
  2019-07-11 15:23             ` Jan Kara
@ 2019-07-11 17:10             ` Theodore Ts'o
  2019-07-12 19:19               ` Thomas Walker
  1 sibling, 1 reply; 21+ messages in thread
From: Theodore Ts'o @ 2019-07-11 17:10 UTC (permalink / raw)
  To: Geoffrey Thomas
  Cc: 'Jan Kara',
	Thomas Walker, 'linux-ext4@vger.kernel.org',
	Darrick J. Wong

Can you try using "df -i" when the file system looks full, and then
reboot, and look at the results of "df -i" afterwards?

Also interesting would be to grab a metadata-only snapshot of the file
system when it is in its mysteriously full state, writing that
snapshot on some other file system *other* than on /dev/sda3:

     e2image -r /dev/sda3 /mnt/sda3.e2i

Then run e2fsck on it:

e2fsck -fy /mnt/sda3.e2i

What I'm curious about is how many "orphaned inodes" are reported, and
how much space they are taking up.  That will look like this:

% gunzip < /usr/src/e2fsprogs/tests/f_orphan/image.gz  > /tmp/foo.img
% e2fsck -fy /tmp/foo.img
e2fsck 1.45.2 (27-May-2019)
Clearing orphaned inode 15 (uid=0, gid=0, mode=040755, size=1024)
Clearing orphaned inode 17 (uid=0, gid=0, mode=0100644, size=0)
Clearing orphaned inode 16 (uid=0, gid=0, mode=040755, size=1024)
Clearing orphaned inode 14 (uid=0, gid=0, mode=0100644, size=69)
Clearing orphaned inode 13 (uid=0, gid=0, mode=040755, size=1024)
...

It's been theorized the bug is in overlayfs, where it's holding inodes
open so the space isn't released.  IIRC somewhat had reported a
similar problem with overlayfs on top of xfs.  (BTW, are you using
overlayfs or aufs with your Docker setup?)

		     	       	      - Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-11 17:10             ` Theodore Ts'o
@ 2019-07-12 19:19               ` Thomas Walker
  2019-07-12 20:28                 ` Theodore Ts'o
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Walker @ 2019-07-12 19:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Geoffrey Thomas, 'Jan Kara',
	'linux-ext4@vger.kernel.org',
	Darrick J. Wong

On Thu, Jul 11, 2019 at 01:10:46PM -0400, Theodore Ts'o wrote:
> Can you try using "df -i" when the file system looks full, and then
> reboot, and look at the results of "df -i" afterwards?

inode usage doesn't change appreciably between the state with "lost" space, after the remount workaround, or after a reboot.

> Also interesting would be to grab a metadata-only snapshot of the file
> system when it is in its mysteriously full state, writing that
> snapshot on some other file system *other* than on /dev/sda3:
> 
>      e2image -r /dev/sda3 /mnt/sda3.e2i
> 
> Then run e2fsck on it:
> 
> e2fsck -fy /mnt/sda3.e2i
> 
> What I'm curious about is how many "orphaned inodes" are reported, and
> how much space they are taking up.  That will look like this:

<..>
Clearing orphaned inode 2360177 (uid=0, gid=0, mode=0100644, size=1035390)
Clearing orphaned inode 2360181 (uid=0, gid=0, mode=0100644, size=185522)
Clearing orphaned inode 2360176 (uid=0, gid=0, mode=0100644, size=1924512)
Clearing orphaned inode 2360180 (uid=0, gid=0, mode=0100644, size=3621978)
Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640, size=39006841856)
release_inode_blocks: Corrupt extent header while calling ext2fs_block_iterate for inode 1048838
<..>

Of particular note, ino 1048838 matches the size of the space that we "lost".
A few months ago I was poking at this with kprobes to try to understand what was happening with the attempt to remount read-only and noticed that it triggered hundreds of:

           <...>-78273 [000] .... 5186888.917840: ext4_remove_blocks: dev 8,3 ino 2889535 extent [0(11384832), 2048]from 0 to 2047 partial_cluster 0
           <...>-78273 [000] .... 5186888.917841: <stack trace>
 => ext4_ext_truncate
 => ext4_truncate
 => ext4_evict_inode
 => evict
 => iput
 => dentry_unlink_inode
 => __dentry_kill
 => dput.part.23
 => dput
 => SyS_rename
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

With all the same inode numbers whose sizes added up to the same amount of space that we had "lost" and got back.  But the inode didn't map to any file or open filehandle though.
Obviously the inode numbers match here and, if I total up all of the ext4_remove_blocks lines, the number of blocks match both what fsck reports and what we "lost"

> 
> ...
> 
> It's been theorized the bug is in overlayfs, where it's holding inodes
> open so the space isn't released.  IIRC somewhat had reported a
> similar problem with overlayfs on top of xfs.  (BTW, are you using
> overlayfs or aufs with your Docker setup?)
> 

Yes, we are using overlayfs and had also heard similar reports.  But the ext4 fs here is the root filesystem,
while all of the overlays are using a separate partition using XFS.  The only interplay between our root fs and
the docker containers is the occasional ro bind mount from root into the container.

Unfortunately, we've not been able to reproduce this outside of our production plant running real workload with real data.  I did capture a metadata dump (with dirents scrambled) as Jan asked, but I suspect it will take some work to get past our security team.  I can certainly do that if we think there is anything valuable though.


Thanks,
Tom.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-12 19:19               ` Thomas Walker
@ 2019-07-12 20:28                 ` Theodore Ts'o
  2019-07-12 21:47                   ` Geoffrey Thomas
  0 siblings, 1 reply; 21+ messages in thread
From: Theodore Ts'o @ 2019-07-12 20:28 UTC (permalink / raw)
  To: Thomas Walker
  Cc: Geoffrey Thomas, 'Jan Kara',
	'linux-ext4@vger.kernel.org',
	Darrick J. Wong

On Fri, Jul 12, 2019 at 03:19:03PM -0400, Thomas Walker wrote:
> Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640, size=39006841856)

> Of particular note, ino 1048838 matches the size of the space that we "lost".

Hmmm... what's gid 4?  Is that a hint of where the inode might have come from?

Can you try the this experiment of e2image... e2fsck, but add a "cp
--sparse" of the e2i file between the e2image and e2fsck step?  Then
when you can identify the inode that has the huge amount of the
orphaned space, try grab the first couple of blocks, and then run
"file" on the first part of the file, which might help you identify
where the file came from.  Is it an ISO file?  etc.

The goal is to come up with a repeatable way of forcing the failure,
so we can understand the root cause of the "lost space".  The fact
that it's an orphaned inode means that something was hanging onto the
inode.  The question is what part of the kernel was keeping the ref
count elevated.

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-12 20:28                 ` Theodore Ts'o
@ 2019-07-12 21:47                   ` Geoffrey Thomas
  2019-07-25 21:22                     ` Geoffrey Thomas
  0 siblings, 1 reply; 21+ messages in thread
From: Geoffrey Thomas @ 2019-07-12 21:47 UTC (permalink / raw)
  To: 'Theodore Ts'o', Thomas Walker
  Cc: 'Jan Kara', 'linux-ext4@vger.kernel.org',
	Darrick J. Wong

On Friday, July 12, 2019 4:28 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> To: Thomas Walker <Thomas.Walker@twosigma.com>
> Cc: Geoffrey Thomas <Geoffrey.Thomas@twosigma.com>; 'Jan Kara'
> <jack@suse.cz>; 'linux-ext4@vger.kernel.org' <linux-ext4@vger.kernel.org>;
> Darrick J. Wong <darrick.wong@oracle.com>
> Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14
> kernels
> 
> On Fri, Jul 12, 2019 at 03:19:03PM -0400, Thomas Walker wrote:
> > Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640,
> size=39006841856)
> 
> > Of particular note, ino 1048838 matches the size of the space that we
> "lost".
> 
> Hmmm... what's gid 4?  Is that a hint of where the inode might have come
> from?

Good call, gid 4 is `adm`. And now that we have an inode number we can see the file's contents, it's from /var/log/account. 

I bet that this is acct(2) holding onto a reference in some weird way (possibly involving logrotate?), which also explains why we couldn't find a userspace process holding onto the inode. We'll investigate a bit....

Thanks,
-- 
Geoffrey Thomas
geofft@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-12 21:47                   ` Geoffrey Thomas
@ 2019-07-25 21:22                     ` Geoffrey Thomas
  2019-07-29 10:09                       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Geoffrey Thomas @ 2019-07-25 21:22 UTC (permalink / raw)
  To: 'Theodore Ts'o', Thomas Walker
  Cc: 'Jan Kara', 'linux-ext4@vger.kernel.org',
	'Darrick J. Wong'

On Friday, July 12, 2019 5:47 PM, Geoffrey Thomas <Geoffrey.Thomas@twosigma.com> wrote:
> On Friday, July 12, 2019 4:28 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > Hmmm... what's gid 4?  Is that a hint of where the inode might have come
> > from?
> 
> Good call, gid 4 is `adm`. And now that we have an inode number we can see
> the file's contents, it's from /var/log/account.
> 
> I bet that this is acct(2) holding onto a reference in some weird way
> (possibly involving logrotate?), which also explains why we couldn't find
> a userspace process holding onto the inode. We'll investigate a bit....

To close this out - yes, this was process accounting. Debian has a nightly cronjob which rotates the pacct logs, runs `invoke-rc.d acct restart` to reopen the file, and compresses the old log. Due to a stray policy-rc.d file from an old provisioning script, however, the restart was being skipped, and so we were unlinking and compressing the pacct file while the kernel still had it open. So it was the classic problem of an open file handle to a large deleted file, except that the open file handle was being held by the kernel.

`accton off` solved our immediate problems and freed the space. I'm not totally sure why a failed umount had that effect, too, but I suppose it turned off process accounting.

It's a little frustrating to me that the file opened by acct(2) doesn't show up to userspace (lsof doesn't seem to find it) - it'd be nice if it could show up in /proc/$some_kernel_thread/fd or somewhere, if possible.

Thanks for the help - the e2image + fsck trick is great!

-- 
Geoffrey Thomas
geofft@twosigma.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
  2019-07-25 21:22                     ` Geoffrey Thomas
@ 2019-07-29 10:09                       ` Jan Kara
  2019-07-29 11:18                         ` ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug? Dmitrij Gusev
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2019-07-29 10:09 UTC (permalink / raw)
  To: Geoffrey Thomas
  Cc: 'Theodore Ts'o', Thomas Walker, 'Jan Kara',
	'linux-ext4@vger.kernel.org', 'Darrick J. Wong'

On Thu 25-07-19 21:22:28, Geoffrey Thomas wrote:
> On Friday, July 12, 2019 5:47 PM, Geoffrey Thomas <Geoffrey.Thomas@twosigma.com> wrote:
> > On Friday, July 12, 2019 4:28 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > > Hmmm... what's gid 4?  Is that a hint of where the inode might have come
> > > from?
> > 
> > Good call, gid 4 is `adm`. And now that we have an inode number we can see
> > the file's contents, it's from /var/log/account.
> > 
> > I bet that this is acct(2) holding onto a reference in some weird way
> > (possibly involving logrotate?), which also explains why we couldn't find
> > a userspace process holding onto the inode. We'll investigate a bit....
> 
> To close this out - yes, this was process accounting. Debian has a nightly cronjob which rotates the pacct logs, runs `invoke-rc.d acct restart` to reopen the file, and compresses the old log. Due to a stray policy-rc.d file from an old provisioning script, however, the restart was being skipped, and so we were unlinking and compressing the pacct file while the kernel still had it open. So it was the classic problem of an open file handle to a large deleted file, except that the open file handle was being held by the kernel.
> 
> `accton off` solved our immediate problems and freed the space. I'm not totally sure why a failed umount had that effect, too, but I suppose it turned off process accounting.
> 
> It's a little frustrating to me that the file opened by acct(2) doesn't show up to userspace (lsof doesn't seem to find it) - it'd be nice if it could show up in /proc/$some_kernel_thread/fd or somewhere, if possible.
> 
> Thanks for the help - the e2image + fsck trick is great!

Glad to hear you were able to solve the problem in the end :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug?
  2019-07-29 10:09                       ` Jan Kara
@ 2019-07-29 11:18                         ` Dmitrij Gusev
  2019-07-29 12:55                           ` Theodore Y. Ts'o
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitrij Gusev @ 2019-07-29 11:18 UTC (permalink / raw)
  To: 'linux-ext4@vger.kernel.org'; +Cc: 'Theodore Ts'o'

Hello.

A ext4 file system is constantly writing to the block device with no 
activity from the applications, is it a bug?

Write speed is about 64k bytes (almost always exactly 64k bytes) per 
second every 1-2 seconds (I've discovered it after a RAID sync 
finished). Please the check activity log sample below.

I could not find out which process was writing constantly, so I rebooted 
to a rescue mode and mounted FS with no activity guaranteed - the same 
activity has started about at 64kB/s every 1-2 seconds.

The file system was created with the latest e2fsprogs, here are the details:

Linux nexus 4.19.62 #2 SMP Sun Jul 28 17:18:10 CDT 2019 x86_64 AMD 
Athlon(tm) Dual Core Processor 5050e AuthenticAMD GNU/Linux

# tune2fs -l /dev/<XXXXXXvg>/home
tune2fs 1.43.1 (08-Jun-2016)  // file system was created in other system 
with the latest version tools
Filesystem volume name:   home
Last mounted on:          /home
Filesystem UUID:          XXXXXXXXXXXXXXXXXX
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype needs_recovery extent 64bit flex_bg sparse_super large_file 
huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              21644032
Block count:              2770432000
Reserved block count:     0
Free blocks:              2095699032
Free inodes:              21205431
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      726
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         256
Inode blocks per group:   16
RAID stride:              128
RAID stripe width:        384
Flex block group size:    16
Filesystem created:       Sat Jul 27 16:41:21 2019
Last mount time:          Mon Jul 29 13:47:33 2019
Last write time:          Mon Jul 29 13:47:33 2019
Mount count:              10
Maximum mount count:      -1
Last checked:             Sat Jul 27 16:41:21 2019
Check interval:           0 (<none>)
Lifetime writes:          2615 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      XXXXXXXXXXXXXXXXXXXXXXX
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x7f0fb170


BlockDev Activity log:

# iostat -kdt 1 /dev/XXXXXXvg/home
Linux 4.19.62 (XXXXX)   07/29/2019      _x86_64_        (2 CPU)

07/29/2019 02:08:03 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.91        15.40        37.77      19309 47364

07/29/2019 02:08:04 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:05 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:06 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:07 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:08 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:09 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:10 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:11 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:12 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:13 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:14 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:15 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:16 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:17 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:18 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:19 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:20 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:21 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:22 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:23 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:24 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64


Thank you,

---

Dmitrij Gusev

Systems architect



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug?
  2019-07-29 11:18                         ` ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug? Dmitrij Gusev
@ 2019-07-29 12:55                           ` Theodore Y. Ts'o
  2019-07-29 21:12                             ` Dmitrij Gusev
  0 siblings, 1 reply; 21+ messages in thread
From: Theodore Y. Ts'o @ 2019-07-29 12:55 UTC (permalink / raw)
  To: Dmitrij Gusev; +Cc: 'linux-ext4@vger.kernel.org'

On Mon, Jul 29, 2019 at 02:18:22PM +0300, Dmitrij Gusev wrote:
> 
> A ext4 file system is constantly writing to the block device with no
> activity from the applications, is it a bug?
> 
> Write speed is about 64k bytes (almost always exactly 64k bytes) per second
> every 1-2 seconds (I've discovered it after a RAID sync finished). Please
> the check activity log sample below.

Is this a freshly created file system?  It could be the lazy inode
table initialization.  You can suppress it using "mount -o
noinit_itable", but it will leave portions of the inode table
unzeroed, which can lead to confusion if the system crashes and e2fsck
has to try to recover the file system.

Or you can not enable lazy inode table initialization when the file
system is created, using "mke2fs -t ext4 -E lazy_itable_init=0
/dev/XXX".  (See the manual page for mke2fs.conf for another way to
turn it off by default.)

Turning off lazy inode table initialization mke2fs to take **much**
longer, especially on large RAID arrays.  The idea is to trade off
mkfs time with background activity to initialize the inode table when
the file system is mounted.  The noinit_itable mount option was added
so that a distro installer can temporarily suppress the background
inode table initialization to speed up the install; but then when the
system is booted, it can run in the background later.


If that's not it, try installing the blktrace package and then run
"btrace /dev/<vg>/home", and see what it reports.  For example, here's
the output from running "touch /mnt/test" (comments prefixed by '#'):

# here's the touch process reading the inode...
259,0    2        1    37.115679608  6646  Q  RM 4232 + 8 [touch]
259,0    2        2    37.115682891  6646  C  RM 4232 + 8 [0]
# here's the journal commit, 5 seconds later
259,0    1       11    42.543705759  6570  Q  WS 3932216 + 8 [jbd2/pmem0-8]
259,0    1       12    42.543709184  6570  C  WS 3932216 + 8 [0]
259,0    1       13    42.543713049  6570  Q  WS 3932224 + 8 [jbd2/pmem0-8]
259,0    1       14    42.543714248  6570  C  WS 3932224 + 8 [0]
259,0    1       15    42.543717049  6570  Q  WS 3932232 + 8 [jbd2/pmem0-8]
259,0    1       16    42.543718193  6570  C  WS 3932232 + 8 [0]
259,0    1       17    42.543720895  6570  Q  WS 3932240 + 8 [jbd2/pmem0-8]
259,0    1       18    42.543722028  6570  C  WS 3932240 + 8 [0]
259,0    1       19    42.543724806  6570  Q  WS 3932248 + 8 [jbd2/pmem0-8]
259,0    1       20    42.543725952  6570  C  WS 3932248 + 8 [0]
259,0    1       21    42.543728697  6570  Q  WS 3932256 + 8 [jbd2/pmem0-8]
259,0    1       22    42.543729799  6570  C  WS 3932256 + 8 [0]
259,0    1       23    42.543745380  6570  Q FWFS 3932264 + 8 [jbd2/pmem0-8]
259,0    1       24    42.543746836  6570  C FWFS 3932264 + 8 [0]
# and here's the writeback to the inode table and superblock,
# 30 seconds later
259,0    1       25    72.836967205    91  Q   W 0 + 8 [kworker/u8:3]
259,0    1       26    72.836970861    91  C   W 0 + 8 [0]
259,0    1       27    72.836984218    91  Q  WM 8 + 8 [kworker/u8:3]
259,0    1       28    72.836985929    91  C  WM 8 + 8 [0]
259,0    1       29    72.836992108    91  Q  WM 4232 + 8 [kworker/u8:3]
259,0    1       30    72.836993953    91  C  WM 4232 + 8 [0]
259,0    1       31    72.837001370    91  Q  WM 4360 + 8 [kworker/u8:3]
259,0    1       32    72.837003210    91  C  WM 4360 + 8 [0]
259,0    1       33    72.837010993    91  Q  WM 69896 + 8 [kworker/u8:3]
259,0    1       34    72.837012564    91  C  WM 69896 + 8 [0]

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug?
  2019-07-29 12:55                           ` Theodore Y. Ts'o
@ 2019-07-29 21:12                             ` Dmitrij Gusev
  0 siblings, 0 replies; 21+ messages in thread
From: Dmitrij Gusev @ 2019-07-29 21:12 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: 'linux-ext4@vger.kernel.org'

Hello.

Yes, this is the lazy inode table initialization.

After I remounted the FS with a "noinit_itable" option the activity has 
stopped (then I remounted it back to continue initialization).

Appreciate your help.

P.S.

Thank you, Theodore and the development team involved for the great ext4 
filesystem.

It is a reliable, high-performance file system. It always served me well 
for many years and continues to do so.

I had never lost any data using it, even though some systems experienced 
many crashes or power losses.

Sincerely,

Dmitrij

On 2019-07-29 15:55, Theodore Y. Ts'o wrote:
> On Mon, Jul 29, 2019 at 02:18:22PM +0300, Dmitrij Gusev wrote:
>> A ext4 file system is constantly writing to the block device with no
>> activity from the applications, is it a bug?
>>
>> Write speed is about 64k bytes (almost always exactly 64k bytes) per second
>> every 1-2 seconds (I've discovered it after a RAID sync finished). Please
>> the check activity log sample below.
> Is this a freshly created file system?  It could be the lazy inode
> table initialization.  You can suppress it using "mount -o
> noinit_itable", but it will leave portions of the inode table
> unzeroed, which can lead to confusion if the system crashes and e2fsck
> has to try to recover the file system.
>
> Or you can not enable lazy inode table initialization when the file
> system is created, using "mke2fs -t ext4 -E lazy_itable_init=0
> /dev/XXX".  (See the manual page for mke2fs.conf for another way to
> turn it off by default.)
>
> Turning off lazy inode table initialization mke2fs to take **much**
> longer, especially on large RAID arrays.  The idea is to trade off
> mkfs time with background activity to initialize the inode table when
> the file system is mounted.  The noinit_itable mount option was added
> so that a distro installer can temporarily suppress the background
> inode table initialization to speed up the install; but then when the
> system is booted, it can run in the background later.
>
>
> If that's not it, try installing the blktrace package and then run
> "btrace /dev/<vg>/home", and see what it reports.  For example, here's
> the output from running "touch /mnt/test" (comments prefixed by '#'):
>
> # here's the touch process reading the inode...
> 259,0    2        1    37.115679608  6646  Q  RM 4232 + 8 [touch]
> 259,0    2        2    37.115682891  6646  C  RM 4232 + 8 [0]
> # here's the journal commit, 5 seconds later
> 259,0    1       11    42.543705759  6570  Q  WS 3932216 + 8 [jbd2/pmem0-8]
> 259,0    1       12    42.543709184  6570  C  WS 3932216 + 8 [0]
> 259,0    1       13    42.543713049  6570  Q  WS 3932224 + 8 [jbd2/pmem0-8]
> 259,0    1       14    42.543714248  6570  C  WS 3932224 + 8 [0]
> 259,0    1       15    42.543717049  6570  Q  WS 3932232 + 8 [jbd2/pmem0-8]
> 259,0    1       16    42.543718193  6570  C  WS 3932232 + 8 [0]
> 259,0    1       17    42.543720895  6570  Q  WS 3932240 + 8 [jbd2/pmem0-8]
> 259,0    1       18    42.543722028  6570  C  WS 3932240 + 8 [0]
> 259,0    1       19    42.543724806  6570  Q  WS 3932248 + 8 [jbd2/pmem0-8]
> 259,0    1       20    42.543725952  6570  C  WS 3932248 + 8 [0]
> 259,0    1       21    42.543728697  6570  Q  WS 3932256 + 8 [jbd2/pmem0-8]
> 259,0    1       22    42.543729799  6570  C  WS 3932256 + 8 [0]
> 259,0    1       23    42.543745380  6570  Q FWFS 3932264 + 8 [jbd2/pmem0-8]
> 259,0    1       24    42.543746836  6570  C FWFS 3932264 + 8 [0]
> # and here's the writeback to the inode table and superblock,
> # 30 seconds later
> 259,0    1       25    72.836967205    91  Q   W 0 + 8 [kworker/u8:3]
> 259,0    1       26    72.836970861    91  C   W 0 + 8 [0]
> 259,0    1       27    72.836984218    91  Q  WM 8 + 8 [kworker/u8:3]
> 259,0    1       28    72.836985929    91  C  WM 8 + 8 [0]
> 259,0    1       29    72.836992108    91  Q  WM 4232 + 8 [kworker/u8:3]
> 259,0    1       30    72.836993953    91  C  WM 4232 + 8 [0]
> 259,0    1       31    72.837001370    91  Q  WM 4360 + 8 [kworker/u8:3]
> 259,0    1       32    72.837003210    91  C  WM 4360 + 8 [0]
> 259,0    1       33    72.837010993    91  Q  WM 69896 + 8 [kworker/u8:3]
> 259,0    1       34    72.837012564    91  C  WM 69896 + 8 [0]
>
> Cheers,
>
> 						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-07-29 21:12 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08 17:59 Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Elana Hashman
2018-11-08 18:13 ` Reindl Harald
2018-11-08 18:20   ` Elana Hashman
2018-11-08 18:47 ` Darrick J. Wong
2018-12-05 16:26   ` Elana Hashman
2019-01-23 19:59     ` Thomas Walker
2019-06-26 15:17       ` Thomas Walker
2019-07-11  9:23         ` Jan Kara
2019-07-11 14:40           ` Geoffrey Thomas
2019-07-11 15:23             ` Jan Kara
2019-07-11 17:10             ` Theodore Ts'o
2019-07-12 19:19               ` Thomas Walker
2019-07-12 20:28                 ` Theodore Ts'o
2019-07-12 21:47                   ` Geoffrey Thomas
2019-07-25 21:22                     ` Geoffrey Thomas
2019-07-29 10:09                       ` Jan Kara
2019-07-29 11:18                         ` ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug? Dmitrij Gusev
2019-07-29 12:55                           ` Theodore Y. Ts'o
2019-07-29 21:12                             ` Dmitrij Gusev
2019-01-24  1:54 ` Phantom full ext4 root filesystems on 4.1 through 4.14 kernels Liu Bo
2019-01-24 14:40   ` Elana Hashman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).