linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Slow file operations on file server with 10 TB hardware RAID and 100 TB software RAID
@ 2021-08-20 14:31 Paul Menzel
  2021-08-20 14:39 ` Paul Menzel
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Menzel @ 2021-08-20 14:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: it+linux-xfs

Dear Linux folks,


Short problem statement: Sometimes changing into a directory on a file 
server wit 30 TB hardware RAID and 100 TB software RAID both formatted 
with XFS takes several seconds.


On a Dell PowerEdge T630 with two Xeon CPU E5-2603 v4 @ 1.70GHz and 96 
GB RAM a 30 TB hardware RAID is served by the hardware RAID controller 
and a 100 TB MDRAID software RAID connected to a Microchip 1100-8e both 
formatted using XFS. Currently, Linux 5.4.39 runs on it.

```
$ more /proc/version
Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 
7.5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020
$ dmesg | grep megar
[   10.322823] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 
EST 2006)
[   10.331910] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 
2006)
[   10.345055] megaraid_sas 0000:03:00.0: BAR:0x1  BAR's 
base_addr(phys):0x0000000092100000  mapped virt_addr:0x0000000059ea5995
[   10.345057] megaraid_sas 0000:03:00.0: FW now in Ready state
[   10.351868] megaraid_sas 0000:03:00.0: 63 bit DMA mask and 32 bit 
consistent mask
[   10.361655] megaraid_sas 0000:03:00.0: firmware supports msix	: (96)
[   10.369433] megaraid_sas 0000:03:00.0: requested/available msix 13/13
[   10.377113] megaraid_sas 0000:03:00.0: current msix/online cpus	: (13/12)
[   10.385190] megaraid_sas 0000:03:00.0: RDPQ mode	: (disabled)
[   10.392092] megaraid_sas 0000:03:00.0: Current firmware supports 
maximum commands: 928	 LDIO threshold: 0
[   10.403895] megaraid_sas 0000:03:00.0: Configured max firmware 
commands: 927
[   10.416840] megaraid_sas 0000:03:00.0: Performance mode :Latency
[   10.424029] megaraid_sas 0000:03:00.0: FW supports sync cache	: No
[   10.431417] megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is 
called outbound_intr_mask:0x40000009
[   10.486158] megaraid_sas 0000:03:00.0: FW provided supportMaxExtLDs: 
1	max_lds: 64
[   10.495502] megaraid_sas 0000:03:00.0: controller type	: MR(2048MB)
[   10.502988] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)	: 
Enabled
[   10.511445] megaraid_sas 0000:03:00.0: Secure JBOD support	: No
[   10.518543] megaraid_sas 0000:03:00.0: NVMe passthru support	: No
[   10.525834] megaraid_sas 0000:03:00.0: FW provided TM TaskAbort/Reset 
timeout: 0 secs/0 secs
[   10.536251] megaraid_sas 0000:03:00.0: JBOD sequence map support	: No
[   10.543931] megaraid_sas 0000:03:00.0: PCI Lane Margining support	: No
[   10.574406] megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is 
called outbound_intr_mask:0x40000000
[   10.585995] megaraid_sas 0000:03:00.0: INIT adapter done
[   10.592409] megaraid_sas 0000:03:00.0: JBOD sequence map is disabled 
megasas_setup_jbod_map 5660
[   10.603273] megaraid_sas 0000:03:00.0: pci id		: 
(0x1000)/(0x005d)/(0x1028)/(0x1f42)
[   10.612815] megaraid_sas 0000:03:00.0: unevenspan support	: yes
[   10.619919] megaraid_sas 0000:03:00.0: firmware crash dump	: no
[   10.627013] megaraid_sas 0000:03:00.0: JBOD sequence map	: disabled
$ dmesg | grep 1100-8e
[   25.853170] smartpqi 0000:84:00.0: added 11:2:0:0 0000000000000000 
RAID              Adaptec  1100-8e
[   25.867069] scsi 11:2:0:0: RAID              Adaptec  1100-8e 
  2.93 PQ: 0 ANSI: 5
$ xfs_info /dev/sdc
meta-data=/dev/sdc               isize=512    agcount=28, 
agsize=268435455 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=0, rmapbt=0
          =                       reflink=0
data     =                       bsize=4096   blocks=7323648000, imaxpct=5
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ xfs_info /dev/md0
meta-data=/dev/md0               isize=512    agcount=102, 
agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=0, rmapbt=0
          =                       reflink=0
data     =                       bsize=4096   blocks=27348633088, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ df -i /dev/sdc
Filesystem         Inodes   IUsed      IFree IUse% Mounted on
/dev/sdc       2929459200 4985849 2924473351    1% /home/pmenzel
$ df -i /dev/md0
Filesystem         Inodes   IUsed      IFree IUse% Mounted on
/dev/md0       2187890624 5331603 2182559021    1% /jbod/M8015
```

After not using a directory for a while (over 24 hours), changing into 
it (locally) takes over five seconds or doing some git operations. For 
example the Linux kernel source git tree located in my home directory. 
(My shell has some git integration showing the branch name in the prompt 
(`/usr/share/git-contrib/completion/git-prompt.sh`.) Once in that 
directory, everything reacts instantly again. When waiting the Linux 
pressure stall information (PSI) shows IO resource contention.

Before:

     $ grep -R . /proc/pressure/
     /proc/pressure/io:some avg10=0.40 avg60=0.10 avg300=0.10 
total=48330841502
     /proc/pressure/io:full avg10=0.40 avg60=0.10 avg300=0.10 
total=48067233340
     /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 
total=755842910
     /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 
total=2530206336
     /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 
total=2318140732

During `git log stable/linux-5.10.y`:

     $ grep -R . /proc/pressure/
     /proc/pressure/io:some avg10=26.20 avg60=9.72 avg300=2.37 
total=48337351849
     /proc/pressure/io:full avg10=26.20 avg60=9.72 avg300=2.37 
total=48073742033
     /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 
total=755843898
     /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 
total=2530209046
     /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 
total=2318143440

The current explanation is, that over night several maintenance scripts 
like backup/mirroring and accounting scripts are run, which touch all 
files on the devices. Additionally sometimes other users run cluster 
jobs with millions of files on the software RAID. Such things invalidate 
the inode cache, and “my” are thrown out. When I use it afterward it’s 
slow in the beginning. There is still free memory during these times 
according to `top`.

Does that sound reasonable with ten million inodes? Is that easily 
verifiable?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Slow file operations on file server with 10 TB hardware RAID and 100 TB software RAID
  2021-08-20 14:31 Slow file operations on file server with 10 TB hardware RAID and 100 TB software RAID Paul Menzel
@ 2021-08-20 14:39 ` Paul Menzel
  2021-08-26 10:41   ` Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID) Paul Menzel
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Menzel @ 2021-08-20 14:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: it+linux-xfs

Dear Linux folks,


Am 20.08.21 um 16:31 schrieb Paul Menzel:

> Short problem statement: Sometimes changing into a directory on a file 
> server wit 30 TB hardware RAID and 100 TB software RAID both formatted 
> with XFS takes several seconds.
> 
> 
> On a Dell PowerEdge T630 with two Xeon CPU E5-2603 v4 @ 1.70GHz and 96 
> GB RAM a 30 TB hardware RAID is served by the hardware RAID controller 
> and a 100 TB MDRAID software RAID connected to a Microchip 1100-8e both 
> formatted using XFS. Currently, Linux 5.4.39 runs on it.
> 
> ```
> $ more /proc/version
> Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020
> $ dmesg | grep megar
> [   10.322823] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> [   10.331910] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> [   10.345055] megaraid_sas 0000:03:00.0: BAR:0x1  BAR's base_addr(phys):0x0000000092100000  mapped virt_addr:0x0000000059ea5995
> [   10.345057] megaraid_sas 0000:03:00.0: FW now in Ready state
> [   10.351868] megaraid_sas 0000:03:00.0: 63 bit DMA mask and 32 bit consistent mask
> [   10.361655] megaraid_sas 0000:03:00.0: firmware supports msix    : (96)
> [   10.369433] megaraid_sas 0000:03:00.0: requested/available msix 13/13
> [   10.377113] megaraid_sas 0000:03:00.0: current msix/online cpus    : (13/12)
> [   10.385190] megaraid_sas 0000:03:00.0: RDPQ mode    : (disabled)
> [   10.392092] megaraid_sas 0000:03:00.0: Current firmware supports maximum commands: 928     LDIO threshold: 0
> [   10.403895] megaraid_sas 0000:03:00.0: Configured max firmware commands: 927
> [   10.416840] megaraid_sas 0000:03:00.0: Performance mode :Latency
> [   10.424029] megaraid_sas 0000:03:00.0: FW supports sync cache    : No
> [   10.431417] megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
> [   10.486158] megaraid_sas 0000:03:00.0: FW provided supportMaxExtLDs: 1    max_lds: 64
> [   10.495502] megaraid_sas 0000:03:00.0: controller type    : MR(2048MB)
> [   10.502988] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)    : Enabled
> [   10.511445] megaraid_sas 0000:03:00.0: Secure JBOD support    : No
> [   10.518543] megaraid_sas 0000:03:00.0: NVMe passthru support    : No
> [   10.525834] megaraid_sas 0000:03:00.0: FW provided TM TaskAbort/Reset timeout: 0 secs/0 secs
> [   10.536251] megaraid_sas 0000:03:00.0: JBOD sequence map support    : No
> [   10.543931] megaraid_sas 0000:03:00.0: PCI Lane Margining support    : No
> [   10.574406] megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
> [   10.585995] megaraid_sas 0000:03:00.0: INIT adapter done
> [   10.592409] megaraid_sas 0000:03:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5660
> [   10.603273] megaraid_sas 0000:03:00.0: pci id        : (0x1000)/(0x005d)/(0x1028)/(0x1f42)
> [   10.612815] megaraid_sas 0000:03:00.0: unevenspan support    : yes
> [   10.619919] megaraid_sas 0000:03:00.0: firmware crash dump    : no
> [   10.627013] megaraid_sas 0000:03:00.0: JBOD sequence map    : disabled
> $ dmesg | grep 1100-8e
> [   25.853170] smartpqi 0000:84:00.0: added 11:2:0:0 0000000000000000 RAID              Adaptec  1100-8e
> [   25.867069] scsi 11:2:0:0: RAID              Adaptec  1100-8e  2.93 PQ: 0 ANSI: 5
> $ xfs_info /dev/sdc
> meta-data=/dev/sdc               isize=512    agcount=28, agsize=268435455 blks
>           =                       sectsz=512   attr=2, projid32bit=1
>           =                       crc=1        finobt=1, sparse=0, rmapbt=0
>           =                       reflink=0
> data     =                       bsize=4096   blocks=7323648000, imaxpct=5
>           =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>           =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ xfs_info /dev/md0
> meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
>           =                       sectsz=4096  attr=2, projid32bit=1
>           =                       crc=1        finobt=1, sparse=0, rmapbt=0
>           =                       reflink=0
> data     =                       bsize=4096   blocks=27348633088, imaxpct=1
>           =                       sunit=128    swidth=1792 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ df -i /dev/sdc
> Filesystem         Inodes   IUsed      IFree IUse% Mounted on
> /dev/sdc       2929459200 4985849 2924473351    1% /home/pmenzel
> $ df -i /dev/md0
> Filesystem         Inodes   IUsed      IFree IUse% Mounted on
> /dev/md0       2187890624 5331603 2182559021    1% /jbod/M8015
> ```
> 
> After not using a directory for a while (over 24 hours), changing into 
> it (locally) takes over five seconds or doing some git operations. For 
> example the Linux kernel source git tree located in my home directory. 
> (My shell has some git integration showing the branch name in the prompt 
> (`/usr/share/git-contrib/completion/git-prompt.sh`.) Once in that 
> directory, everything reacts instantly again. When waiting the Linux 
> pressure stall information (PSI) shows IO resource contention.
> 
> Before:
> 
>      $ grep -R . /proc/pressure/
>      /proc/pressure/io:some avg10=0.40 avg60=0.10 avg300=0.10 total=48330841502
>      /proc/pressure/io:full avg10=0.40 avg60=0.10 avg300=0.10 total=48067233340
>      /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755842910
>      /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530206336
>      /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318140732
> 
> During `git log stable/linux-5.10.y`:
> 
>      $ grep -R . /proc/pressure/
>      /proc/pressure/io:some avg10=26.20 avg60=9.72 avg300=2.37 total=48337351849
>      /proc/pressure/io:full avg10=26.20 avg60=9.72 avg300=2.37 total=48073742033
>      /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755843898
>      /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530209046
>      /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318143440
> 
> The current explanation is, that over night several maintenance scripts 
> like backup/mirroring and accounting scripts are run, which touch all 
> files on the devices. Additionally sometimes other users run cluster 
> jobs with millions of files on the software RAID. Such things invalidate 
> the inode cache, and “my” are thrown out. When I use it afterward it’s 
> slow in the beginning. There is still free memory during these times 
> according to `top`.

     $ free -h
                   total        used        free      shared  buff/cache 
   available
     Mem:            94G        8.3G        5.3G        2.3M         80G 
         83G
     Swap:            0B          0B          0B

> Does that sound reasonable with ten million inodes? Is that easily 
> verifiable?

If an inode consume 512 bytes with ten million inodes, that would be 
around 500 MB, which should easily fit into the cache, so it does not 
need to be invalidated?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID)
  2021-08-20 14:39 ` Paul Menzel
@ 2021-08-26 10:41   ` Paul Menzel
  2021-08-26 16:49     ` Donald Buczek
  2021-08-26 21:53     ` Dave Chinner
  0 siblings, 2 replies; 5+ messages in thread
From: Paul Menzel @ 2021-08-26 10:41 UTC (permalink / raw)
  To: LKML; +Cc: it+linux-xfs, linux-fsdevel, linux-xfs, linux-mm

Dear Linux folks,


Am 20.08.21 um 16:39 schrieb Paul Menzel:

> Am 20.08.21 um 16:31 schrieb Paul Menzel:
> 
>> Short problem statement: Sometimes changing into a directory on a file 
>> server wit 30 TB hardware RAID and 100 TB software RAID both formatted 
>> with XFS takes several seconds.
>>
>>
>> On a Dell PowerEdge T630 with two Xeon CPU E5-2603 v4 @ 1.70GHz and 96 
>> GB RAM a 30 TB hardware RAID is served by the hardware RAID controller 
>> and a 100 TB MDRAID software RAID connected to a Microchip 1100-8e 
>> both formatted using XFS. Currently, Linux 5.4.39 runs on it.
>>
>> ```
>> $ more /proc/version
>> Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020
>> $ dmesg | grep megar
>> [   10.322823] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>> [   10.331910] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>> [   10.345055] megaraid_sas 0000:03:00.0: BAR:0x1  BAR's base_addr(phys):0x0000000092100000  mapped virt_addr:0x0000000059ea5995
>> [   10.345057] megaraid_sas 0000:03:00.0: FW now in Ready state
>> [   10.351868] megaraid_sas 0000:03:00.0: 63 bit DMA mask and 32 bit consistent mask
>> [   10.361655] megaraid_sas 0000:03:00.0: firmware supports msix    : (96)
>> [   10.369433] megaraid_sas 0000:03:00.0: requested/available msix 13/13
>> [   10.377113] megaraid_sas 0000:03:00.0: current msix/online cpus    : (13/12)
>> [   10.385190] megaraid_sas 0000:03:00.0: RDPQ mode    : (disabled)
>> [   10.392092] megaraid_sas 0000:03:00.0: Current firmware supports maximum commands: 928     LDIO threshold: 0
>> [   10.403895] megaraid_sas 0000:03:00.0: Configured max firmware commands: 927
>> [   10.416840] megaraid_sas 0000:03:00.0: Performance mode :Latency
>> [   10.424029] megaraid_sas 0000:03:00.0: FW supports sync cache    : No
>> [   10.431417] megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
>> [   10.486158] megaraid_sas 0000:03:00.0: FW provided supportMaxExtLDs: 1    max_lds: 64
>> [   10.495502] megaraid_sas 0000:03:00.0: controller type    : MR(2048MB)
>> [   10.502988] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)    : Enabled
>> [   10.511445] megaraid_sas 0000:03:00.0: Secure JBOD support    : No
>> [   10.518543] megaraid_sas 0000:03:00.0: NVMe passthru support    : No
>> [   10.525834] megaraid_sas 0000:03:00.0: FW provided TM TaskAbort/Reset timeout: 0 secs/0 secs
>> [   10.536251] megaraid_sas 0000:03:00.0: JBOD sequence map support    : No
>> [   10.543931] megaraid_sas 0000:03:00.0: PCI Lane Margining support    : No
>> [   10.574406] megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
>> [   10.585995] megaraid_sas 0000:03:00.0: INIT adapter done
>> [   10.592409] megaraid_sas 0000:03:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5660
>> [   10.603273] megaraid_sas 0000:03:00.0: pci id        : (0x1000)/(0x005d)/(0x1028)/(0x1f42)
>> [   10.612815] megaraid_sas 0000:03:00.0: unevenspan support    : yes
>> [   10.619919] megaraid_sas 0000:03:00.0: firmware crash dump    : no
>> [   10.627013] megaraid_sas 0000:03:00.0: JBOD sequence map    : disabled
>> $ dmesg | grep 1100-8e
>> [   25.853170] smartpqi 0000:84:00.0: added 11:2:0:0 0000000000000000 RAID              Adaptec  1100-8e
>> [   25.867069] scsi 11:2:0:0: RAID              Adaptec  1100-8e  2.93 PQ: 0 ANSI: 5
>> $ xfs_info /dev/sdc
>> meta-data=/dev/sdc               isize=512    agcount=28, agsize=268435455 blks
>>           =                       sectsz=512   attr=2, projid32bit=1
>>           =                       crc=1        finobt=1, sparse=0, rmapbt=0
>>           =                       reflink=0
>> data     =                       bsize=4096   blocks=7323648000, imaxpct=5
>>           =                       sunit=0      swidth=0 blks
>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>> log      =internal log           bsize=4096   blocks=521728, version=2
>>           =                       sectsz=512   sunit=0 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>> $ xfs_info /dev/md0
>> meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
>>           =                       sectsz=4096  attr=2, projid32bit=1
>>           =                       crc=1        finobt=1, sparse=0, rmapbt=0
>>           =                       reflink=0
>> data     =                       bsize=4096   blocks=27348633088, imaxpct=1
>>           =                       sunit=128    swidth=1792 blks
>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>> log      =internal log           bsize=4096   blocks=521728, version=2
>>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>> $ df -i /dev/sdc
>> Filesystem         Inodes   IUsed      IFree IUse% Mounted on
>> /dev/sdc       2929459200 4985849 2924473351    1% /home/pmenzel
>> $ df -i /dev/md0
>> Filesystem         Inodes   IUsed      IFree IUse% Mounted on
>> /dev/md0       2187890624 5331603 2182559021    1% /jbod/M8015
>> ```
>>
>> After not using a directory for a while (over 24 hours), changing into 
>> it (locally) takes over five seconds or doing some git operations. For 
>> example the Linux kernel source git tree located in my home directory. 
>> (My shell has some git integration showing the branch name in the 
>> prompt (`/usr/share/git-contrib/completion/git-prompt.sh`.) Once in 
>> that directory, everything reacts instantly again. When waiting the 
>> Linux pressure stall information (PSI) shows IO resource contention.
>>
>> Before:
>>
>>      $ grep -R . /proc/pressure/
>>      /proc/pressure/io:some avg10=0.40 avg60=0.10 avg300=0.10 total=48330841502
>>      /proc/pressure/io:full avg10=0.40 avg60=0.10 avg300=0.10 total=48067233340
>>      /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755842910
>>      /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530206336
>>      /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318140732
>>
>> During `git log stable/linux-5.10.y`:
>>
>>      $ grep -R . /proc/pressure/
>>      /proc/pressure/io:some avg10=26.20 avg60=9.72 avg300=2.37 total=48337351849
>>      /proc/pressure/io:full avg10=26.20 avg60=9.72 avg300=2.37 total=48073742033
>>      /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755843898
>>      /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530209046
>>      /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318143440
>>
>> The current explanation is, that over night several maintenance 
>> scripts like backup/mirroring and accounting scripts are run, which 
>> touch all files on the devices. Additionally sometimes other users run 
>> cluster jobs with millions of files on the software RAID. Such things 
>> invalidate the inode cache, and “my” are thrown out. When I use it 
>> afterward it’s slow in the beginning. There is still free memory 
>> during these times according to `top`.
> 
>      $ free -h
>                    total        used        free      shared  buff/cache available
>      Mem:            94G        8.3G        5.3G        2.3M         80G       83G
>      Swap:            0B          0B          0B
> 
>> Does that sound reasonable with ten million inodes? Is that easily 
>> verifiable?
> 
> If an inode consume 512 bytes with ten million inodes, that would be 
> around 500 MB, which should easily fit into the cache, so it does not 
> need to be invalidated?

Something is wrong with that calculation, and the cache size is much bigger.

Looking into `/proc/slabinfo` and XFS’ runtime/internal statistics [1], 
it turns out that the inode cache is likely the problem.

XFS’ internal stats show that only one third of the inodes requests are 
answered from cache.

     $ grep ^ig /sys/fs/xfs/stats/stats
     ig 1791207386 647353522 20111 1143854223 394 1142080045 10683174

During the problematic time, the SLAB size is around 4 GB and, according 
to slabinfo, the inode cache only has around 200.000 (sometimes even as 
low as 50.000).

     $ sudo grep inode /proc/slabinfo
     nfs_inode_cache       16     24   1064    3    1 : tunables   24 
12    8 : slabdata      8      8      0
     rpc_inode_cache       94    138    640    6    1 : tunables   54 
27    8 : slabdata     23     23      0
     mqueue_inode_cache      1      4    896    4    1 : tunables   54 
  27    8 : slabdata      1      1      0
     xfs_inode         1693683 1722284    960    4    1 : tunables   54 
   27    8 : slabdata 430571 430571      0
     ext2_inode_cache       0      0    768    5    1 : tunables   54 
27    8 : slabdata      0      0      0
     reiser_inode_cache      0      0    760    5    1 : tunables   54 
  27    8 : slabdata      0      0      0
     hugetlbfs_inode_cache      2     12    608    6    1 : tunables 
54   27    8 : slabdata      2      2      0
     sock_inode_cache     346    670    768    5    1 : tunables   54 
27    8 : slabdata    134    134      0
     proc_inode_cache     121    288    656    6    1 : tunables   54 
27    8 : slabdata     48     48      0
     shmem_inode_cache   2249   2827    696   11    2 : tunables   54 
27    8 : slabdata    257    257      0
     inode_cache       209098 209482    584    7    1 : tunables   54 
27    8 : slabdata  29926  29926      0

(What is the difference between `xfs_inode` and `inode_cache`?)

Then going through all the files with `find -ls`, the inode cache grows 
to four to five million and the SLAB size grows to around 8 GB. Over 
night it shrinks back to the numbers above and the page cache grows back.

In the discussions [2], adji`vfs_cache_pressure` is recommended, but – 
besides setting it to 0 – it only seems to delay the shrinking of the 
cache. (As it’s an integer 1 is the lowest non-zero (positive) number, 
which would delay it by a factor of 100.

Is there a way to specify the minimum numbers of entries in the inode 
cache, or a minimum SLAB size up to that the caches should not be decreased?


Kind regards,

Paul


[1]: https://xfs.org/index.php/Runtime_Stats#ig
[2]: 
https://linux-xfs.oss.sgi.narkive.com/qa0AYeBS/improving-xfs-file-system-inode-performance
      "Improving XFS file system inode performance" from 2010

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID)
  2021-08-26 10:41   ` Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID) Paul Menzel
@ 2021-08-26 16:49     ` Donald Buczek
  2021-08-26 21:53     ` Dave Chinner
  1 sibling, 0 replies; 5+ messages in thread
From: Donald Buczek @ 2021-08-26 16:49 UTC (permalink / raw)
  To: Paul Menzel, LKML; +Cc: it+linux-xfs, linux-fsdevel, linux-xfs, linux-mm

On 26.08.21 12:41, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Am 20.08.21 um 16:39 schrieb Paul Menzel:
> 
>> Am 20.08.21 um 16:31 schrieb Paul Menzel:
>>
>>> Short problem statement: Sometimes changing into a directory on a file server wit 30 TB hardware RAID and 100 TB software RAID both formatted with XFS takes several seconds.
>>>
>>>
>>> On a Dell PowerEdge T630 with two Xeon CPU E5-2603 v4 @ 1.70GHz and 96 GB RAM a 30 TB hardware RAID is served by the hardware RAID controller and a 100 TB MDRAID software RAID connected to a Microchip 1100-8e both formatted using XFS. Currently, Linux 5.4.39 runs on it.
>>>
>>> ```
>>> $ more /proc/version
>>> Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020
>>> $ dmesg | grep megar
>>> [   10.322823] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>>> [   10.331910] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>>> [   10.345055] megaraid_sas 0000:03:00.0: BAR:0x1  BAR's base_addr(phys):0x0000000092100000  mapped virt_addr:0x0000000059ea5995
>>> [   10.345057] megaraid_sas 0000:03:00.0: FW now in Ready state
>>> [   10.351868] megaraid_sas 0000:03:00.0: 63 bit DMA mask and 32 bit consistent mask
>>> [   10.361655] megaraid_sas 0000:03:00.0: firmware supports msix    : (96)
>>> [   10.369433] megaraid_sas 0000:03:00.0: requested/available msix 13/13
>>> [   10.377113] megaraid_sas 0000:03:00.0: current msix/online cpus    : (13/12)
>>> [   10.385190] megaraid_sas 0000:03:00.0: RDPQ mode    : (disabled)
>>> [   10.392092] megaraid_sas 0000:03:00.0: Current firmware supports maximum commands: 928     LDIO threshold: 0
>>> [   10.403895] megaraid_sas 0000:03:00.0: Configured max firmware commands: 927
>>> [   10.416840] megaraid_sas 0000:03:00.0: Performance mode :Latency
>>> [   10.424029] megaraid_sas 0000:03:00.0: FW supports sync cache    : No
>>> [   10.431417] megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
>>> [   10.486158] megaraid_sas 0000:03:00.0: FW provided supportMaxExtLDs: 1    max_lds: 64
>>> [   10.495502] megaraid_sas 0000:03:00.0: controller type    : MR(2048MB)
>>> [   10.502988] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)    : Enabled
>>> [   10.511445] megaraid_sas 0000:03:00.0: Secure JBOD support    : No
>>> [   10.518543] megaraid_sas 0000:03:00.0: NVMe passthru support    : No
>>> [   10.525834] megaraid_sas 0000:03:00.0: FW provided TM TaskAbort/Reset timeout: 0 secs/0 secs
>>> [   10.536251] megaraid_sas 0000:03:00.0: JBOD sequence map support    : No
>>> [   10.543931] megaraid_sas 0000:03:00.0: PCI Lane Margining support    : No
>>> [   10.574406] megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
>>> [   10.585995] megaraid_sas 0000:03:00.0: INIT adapter done
>>> [   10.592409] megaraid_sas 0000:03:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5660
>>> [   10.603273] megaraid_sas 0000:03:00.0: pci id        : (0x1000)/(0x005d)/(0x1028)/(0x1f42)
>>> [   10.612815] megaraid_sas 0000:03:00.0: unevenspan support    : yes
>>> [   10.619919] megaraid_sas 0000:03:00.0: firmware crash dump    : no
>>> [   10.627013] megaraid_sas 0000:03:00.0: JBOD sequence map    : disabled
>>> $ dmesg | grep 1100-8e
>>> [   25.853170] smartpqi 0000:84:00.0: added 11:2:0:0 0000000000000000 RAID              Adaptec  1100-8e
>>> [   25.867069] scsi 11:2:0:0: RAID              Adaptec  1100-8e  2.93 PQ: 0 ANSI: 5
>>> $ xfs_info /dev/sdc
>>> meta-data=/dev/sdc               isize=512    agcount=28, agsize=268435455 blks
>>>           =                       sectsz=512   attr=2, projid32bit=1
>>>           =                       crc=1        finobt=1, sparse=0, rmapbt=0
>>>           =                       reflink=0
>>> data     =                       bsize=4096   blocks=7323648000, imaxpct=5
>>>           =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>> log      =internal log           bsize=4096   blocks=521728, version=2
>>>           =                       sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>> $ xfs_info /dev/md0
>>> meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
>>>           =                       sectsz=4096  attr=2, projid32bit=1
>>>           =                       crc=1        finobt=1, sparse=0, rmapbt=0
>>>           =                       reflink=0
>>> data     =                       bsize=4096   blocks=27348633088, imaxpct=1
>>>           =                       sunit=128    swidth=1792 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>> log      =internal log           bsize=4096   blocks=521728, version=2
>>>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>> $ df -i /dev/sdc
>>> Filesystem         Inodes   IUsed      IFree IUse% Mounted on
>>> /dev/sdc       2929459200 4985849 2924473351    1% /home/pmenzel
>>> $ df -i /dev/md0
>>> Filesystem         Inodes   IUsed      IFree IUse% Mounted on
>>> /dev/md0       2187890624 5331603 2182559021    1% /jbod/M8015
>>> ```
>>>
>>> After not using a directory for a while (over 24 hours), changing into it (locally) takes over five seconds or doing some git operations. For example the Linux kernel source git tree located in my home directory. (My shell has some git integration showing the branch name in the prompt (`/usr/share/git-contrib/completion/git-prompt.sh`.) Once in that directory, everything reacts instantly again. When waiting the Linux pressure stall information (PSI) shows IO resource contention.
>>>
>>> Before:
>>>
>>>      $ grep -R . /proc/pressure/
>>>      /proc/pressure/io:some avg10=0.40 avg60=0.10 avg300=0.10 total=48330841502
>>>      /proc/pressure/io:full avg10=0.40 avg60=0.10 avg300=0.10 total=48067233340
>>>      /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755842910
>>>      /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530206336
>>>      /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318140732
>>>
>>> During `git log stable/linux-5.10.y`:
>>>
>>>      $ grep -R . /proc/pressure/
>>>      /proc/pressure/io:some avg10=26.20 avg60=9.72 avg300=2.37 total=48337351849
>>>      /proc/pressure/io:full avg10=26.20 avg60=9.72 avg300=2.37 total=48073742033
>>>      /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755843898
>>>      /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530209046
>>>      /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318143440
>>>
>>> The current explanation is, that over night several maintenance scripts like backup/mirroring and accounting scripts are run, which touch all files on the devices. Additionally sometimes other users run cluster jobs with millions of files on the software RAID. Such things invalidate the inode cache, and “my” are thrown out. When I use it afterward it’s slow in the beginning. There is still free memory during these times according to `top`.
>>
>>      $ free -h
>>                    total        used        free      shared  buff/cache available
>>      Mem:            94G        8.3G        5.3G        2.3M         80G       83G
>>      Swap:            0B          0B          0B
>>
>>> Does that sound reasonable with ten million inodes? Is that easily verifiable?
>>
>> If an inode consume 512 bytes with ten million inodes, that would be around 500 MB, which should easily fit into the cache, so it does not need to be invalidated?
> 
> Something is wrong with that calculation, and the cache size is much bigger.
> 
> Looking into `/proc/slabinfo` and XFS’ runtime/internal statistics [1], it turns out that the inode cache is likely the problem.
> 
> XFS’ internal stats show that only one third of the inodes requests are answered from cache.
> 
>      $ grep ^ig /sys/fs/xfs/stats/stats
>      ig 1791207386 647353522 20111 1143854223 394 1142080045 10683174
> 
> During the problematic time, the SLAB size is around 4 GB and, according to slabinfo, the inode cache only has around 200.000 (sometimes even as low as 50.000).
> 
>      $ sudo grep inode /proc/slabinfo
>      nfs_inode_cache       16     24   1064    3    1 : tunables   24 12    8 : slabdata      8      8      0
>      rpc_inode_cache       94    138    640    6    1 : tunables   54 27    8 : slabdata     23     23      0
>      mqueue_inode_cache      1      4    896    4    1 : tunables   54  27    8 : slabdata      1      1      0
>      xfs_inode         1693683 1722284    960    4    1 : tunables   54   27    8 : slabdata 430571 430571      0
>      ext2_inode_cache       0      0    768    5    1 : tunables   54 27    8 : slabdata      0      0      0
>      reiser_inode_cache      0      0    760    5    1 : tunables   54  27    8 : slabdata      0      0      0
>      hugetlbfs_inode_cache      2     12    608    6    1 : tunables 54   27    8 : slabdata      2      2      0
>      sock_inode_cache     346    670    768    5    1 : tunables   54 27    8 : slabdata    134    134      0
>      proc_inode_cache     121    288    656    6    1 : tunables   54 27    8 : slabdata     48     48      0
>      shmem_inode_cache   2249   2827    696   11    2 : tunables   54 27    8 : slabdata    257    257      0
>      inode_cache       209098 209482    584    7    1 : tunables   54 27    8 : slabdata  29926  29926      0
> 
> (What is the difference between `xfs_inode` and `inode_cache`?)
> 
> Then going through all the files with `find -ls`, the inode cache grows to four to five million and the SLAB size grows to around 8 GB. Over night it shrinks back to the numbers above and the page cache grows back.

Maybe this demonstrates what is is probably happening:

==============================
#! /usr/bin/bash

cd /amd/claptrap/1/tmp

if [ ! -d many-files ]; then
     mkdir -p many-files
     for i in $(seq -w 5); do
         mkdir many-files/$i
         for j in $(seq -w 1000); do
             mkdir -p many-files/$i/$j
             for k in $(seq -w 1000); do
                 touch many-files/$i/$j/$k
             done
         done
     done
fi

test -e big-file.dat || fallocate -l $((600*1024*1024*1024)) big-file.dat

echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null

echo "# Start:"
grep -E "^(MemTotal|MemFree|Cached|Active\(file\)|Inactive\(file\)|Slab):" /proc/meminfo
sudo grep xfs_inode /proc/slabinfo

find many-files -ls > /dev/null

echo "# After walking many files :"
grep -E "^(MemTotal|MemFree|Cached|Active\(file\)|Inactive\(file\)|Slab):" /proc/meminfo
sudo grep xfs_inode /proc/slabinfo

cat big-file.dat > /dev/null
echo "# After reading big file:"
grep -E "^(MemTotal|MemFree|Cached|Active\(file\)|Inactive\(file\)|Slab):" /proc/meminfo
sudo grep xfs_inode /proc/slabinfo
==============================

Output:

# Start:
MemTotal:       98634372 kB
MemFree:        97586092 kB
Cached:           115184 kB
Active(file):     100992 kB
Inactive(file):     8984 kB
Slab:             334300 kB
xfs_inode           1329   2272    960    4    1 : tunables   54   27    8 : slabdata    568    568    333
# After walking many files :
MemTotal:       98634372 kB
MemFree:        88795708 kB
Cached:           138024 kB
Active(file):     106740 kB
Inactive(file):    28176 kB
Slab:            6445960 kB
xfs_inode         5006003 5006008    960    4    1 : tunables   54   27    8 : slabdata 1251502 1251502      0
# After reading big file:
MemTotal:       98634372 kB
MemFree:          495240 kB
Cached:         95767564 kB
Active(file):     109404 kB
Inactive(file): 95655164 kB
Slab:            1693884 kB
xfs_inode          67714  68324    960    4    1 : tunables   54   27    8 : slabdata  17081  17081    243

So reading just one single file, which is bigger then the memory of the system, reads the file data through the page cache and shrinks the slabs by the way and the valuable vfs cache is lost. Instead, the memory is filled with the tail of the big file, which wasn't even helpful if the file was read again.

> In the discussions [2], adji`vfs_cache_pressure` is recommended, but – besides setting it to 0 – it only seems to delay the shrinking of the cache. (As it’s an integer 1 is the lowest non-zero (positive) number, which would delay it by a factor of 100.
> 
> Is there a way to specify the minimum numbers of entries in the inode cache, or a minimum SLAB size up to that the caches should not be decreased?
Or limit the page cache.

There was an attempt to make that possible [1], but it looks like it didn't get anywhere.

[1]: https://lwn.net/Articles/602424/

Best

   Donald

> Kind regards,
> 
> Paul
> 
> 
> [1]: https://xfs.org/index.php/Runtime_Stats#ig
> [2]: https://linux-xfs.oss.sgi.narkive.com/qa0AYeBS/improving-xfs-file-system-inode-performance
>       "Improving XFS file system inode performance" from 2010

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID)
  2021-08-26 10:41   ` Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID) Paul Menzel
  2021-08-26 16:49     ` Donald Buczek
@ 2021-08-26 21:53     ` Dave Chinner
  1 sibling, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2021-08-26 21:53 UTC (permalink / raw)
  To: Paul Menzel; +Cc: LKML, it+linux-xfs, linux-fsdevel, linux-xfs, linux-mm

On Thu, Aug 26, 2021 at 12:41:25PM +0200, Paul Menzel wrote:
> Dear Linux folks,
> > > The current explanation is, that over night several maintenance
> > > scripts like backup/mirroring and accounting scripts are run, which
> > > touch all files on the devices. Additionally sometimes other users
> > > run cluster jobs with millions of files on the software RAID. Such
> > > things invalidate the inode cache, and “my” are thrown out. When I
> > > use it afterward it’s slow in the beginning. There is still free
> > > memory during these times according to `top`.

Yup. Your inodes are not in use, so they get cycled out of memory
for other inodes that are in active use.

> >      $ free -h
> >                    total        used        free      shared  buff/cache available
> >      Mem:            94G        8.3G        5.3G        2.3M         80G       83G
> >      Swap:            0B          0B          0B
> > 
> > > Does that sound reasonable with ten million inodes? Is that easily
> > > verifiable?
> > 
> > If an inode consume 512 bytes with ten million inodes, that would be
> > around 500 MB, which should easily fit into the cache, so it does not
> > need to be invalidated?
> 
> Something is wrong with that calculation, and the cache size is much bigger.

Inode size on disk != inode size in memory. Typically a clean XFS
inode in memory takes up ~1.1kB, regardless of on-disk size. An
inode that has been dirtied takes about 1.7kB.

> Looking into `/proc/slabinfo` and XFS’ runtime/internal statistics [1], it
> turns out that the inode cache is likely the problem.
> 
> XFS’ internal stats show that only one third of the inodes requests are
> answered from cache.
> 
>     $ grep ^ig /sys/fs/xfs/stats/stats
>     ig 1791207386 647353522 20111 1143854223 394 1142080045 10683174

Pretty normal for a machine that has diverse worklaods, large data
sets and fairly constant memory pressure...

> During the problematic time, the SLAB size is around 4 GB and, according to
> slabinfo, the inode cache only has around 200.000 (sometimes even as low as
> 50.000).

Yup, that indicates the workload that has been running has been
generating either user space or page cache memory pressure, not
inode cache memory pressure. As a result, memory reclaim has
reclaimed the unused inode caches. This is how things are supposed
to work - the kernel adjusts it's memory usage according what is
consuming memory at the time there is memory demand.

>     $ sudo grep inode /proc/slabinfo
>     nfs_inode_cache       16     24   1064    3    1 : tunables   24 12    8
> : slabdata      8      8      0
>     rpc_inode_cache       94    138    640    6    1 : tunables   54 27    8
> : slabdata     23     23      0
>     mqueue_inode_cache      1      4    896    4    1 : tunables   54  27
> 8 : slabdata      1      1      0
>     xfs_inode         1693683 1722284    960    4    1 : tunables   54   27
> 8 : slabdata 430571 430571      0
>     ext2_inode_cache       0      0    768    5    1 : tunables   54 27    8
> : slabdata      0      0      0
>     reiser_inode_cache      0      0    760    5    1 : tunables   54  27
> 8 : slabdata      0      0      0
>     hugetlbfs_inode_cache      2     12    608    6    1 : tunables 54   27
> 8 : slabdata      2      2      0
>     sock_inode_cache     346    670    768    5    1 : tunables   54 27    8
> : slabdata    134    134      0
>     proc_inode_cache     121    288    656    6    1 : tunables   54 27    8
> : slabdata     48     48      0
>     shmem_inode_cache   2249   2827    696   11    2 : tunables   54 27    8
> : slabdata    257    257      0
>     inode_cache       209098 209482    584    7    1 : tunables   54 27    8
> : slabdata  29926  29926      0
> 
> (What is the difference between `xfs_inode` and `inode_cache`?)

"inode_cache" is the generic inode slab cache used for things like
/proc and other VFS level psuedo filesytems. "xfs_inode_cache" is
the inodes used by XFS.

> Then going through all the files with `find -ls`, the inode cache grows to
> four to five million and the SLAB size grows to around 8 GB. Over night it
> shrinks back to the numbers above and the page cache grows back.

Yup, that's caching all the inodes the find traverses because it is
accessing the inodes and not just reading the directory structure.
There's likely 4-5 million inodes in that directory structure.

This is normal - the kernel is adjusting it's memory usage according
to the workload that is currently running. However, if you don't
access those inodes again, and the system is put under memory
pressure, they'll get reclaimed and the memory used for whatever is
demanding memory at that point in time.

Again, this is normal behaviour for machines with mulitple discrete,
diverse workloads with individual data sets and memory demand that,
in aggregate, are larger than the machine has the memory to hold. At
some point, we have to give back kernel memory so the current
application and data set can run efficiently from RAM...

> In the discussions [2], adji`vfs_cache_pressure` is recommended, but –
> besides setting it to 0 – it only seems to delay the shrinking of the cache.
> (As it’s an integer 1 is the lowest non-zero (positive) number, which would
> delay it by a factor of 100.

That's exactly what vfs_cache_pressure is intended to do - you can
slow down the reclaim of inodes and dentries, but if you have enough
memory demand for long enough, it will not prevent indoes that have
not been accessed for hours from being reclaimed.

Of course, setting it so zero is also behaving as expected - that
prevents memory reclaim from reclaiming dentries and inodes and
other filesystem caches. This is absolutely not recommended as it
can result in all of memory being filled with filesystem caches and
the system can then OOM in unrecoverable ways because the memory
held in VFS caches cannot be reclaimed.

> Is there a way to specify the minimum numbers of entries in the inode cache,
> or a minimum SLAB size up to that the caches should not be decreased?

You have a workload resource control problem, not an inode cache
problem. This is a problem control groups are intended to solve. For
controlling memory usage behaviour of workloads, see:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-26 21:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20 14:31 Slow file operations on file server with 10 TB hardware RAID and 100 TB software RAID Paul Menzel
2021-08-20 14:39 ` Paul Menzel
2021-08-26 10:41   ` Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID) Paul Menzel
2021-08-26 16:49     ` Donald Buczek
2021-08-26 21:53     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).