All of lore.kernel.org
 help / color / mirror / Atom feed
* DAX 2MB mappings for XFS
@ 2018-01-12 19:40 ` Kani, Toshi
  0 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-12 19:40 UTC (permalink / raw)
  To: ross.zwisler, linux-nvdimm, david; +Cc: linux-fsdevel

Hello,

I noticed that DAX 2MB mmap no longer works on XFS.  I used the
following steps on a 4.15-rc7 kernel.  Am I missing something, or is
there a problem in XFS?

# mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
# mount -o dax /dev/pmem0 /mnt/pmem0
# xfs_io -c "extsize 2m" /mnt/pmem0

fio with libpmem engine (which uses mmap) is slow since it gets
serialized by 4KB page faults.

# numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem0/testfile 
--rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
direct=1
  :
Run status group 0 (all jobs):
   READ: bw=4357MiB/s (4569MB/s), 4357MiB/s-4357MiB/s (4569MB/s-
4569MB/s), io=96.0GiB (103GB), run=22560-22560msec

Resulted file blocks in "testfile" are not aligned by 2MB.

# filefrag -v /mnt/pmem0/testfile
Filesystem type is: 58465342
File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..  261111:        520..    261631: 261112:
   1:   261112..  261348:         12..       248:    237:     261632:
   2:   261349..  522705:     261644..    523000: 261357:        249:
   3:   522706..  784062:     523276..    784632: 261357:     523001:
   4:   784063.. 1045419:     784908..   1046264: 261357:     784633:
   5:  1045420.. 1304216:    1049100..   1307896: 258797:    1046265:
   6:  1304217.. 1565573:    1308172..   1569528: 261357:    1307897:
   7:  1565574.. 1572863:    1570304..   1577593:   7290:    1569529: 
last,eof
testfile: 8 extents found

A file created by fallocate also shows that physical offset starts from
520, which is not aligned by 2MB. 

# fallocate --length 1G /mnt/pmem0/data
# filefrag -v /mnt/pmem0/data
Filesystem type is: 58465342
File size of /mnt/pmem0/data is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..  260607:        520..    261127:
260608:             unwritten
   1:   260608..  262143:     262144..    263679:   1536:     261128:
last,unwritten,eof
/mnt/pmem0/data: 2 extents found

ext4 does not have the issue in the same steps.

# mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem1
# mount -o dax /dev/pmem1 /mnt/pmem1
# numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem1/testfile 
--rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
direct=1      
  :
Run status group 0 (all jobs):
   READ: bw=44.4GiB/s (47.7GB/s), 44.4GiB/s-44.4GiB/s (47.7GB/s-
47.7GB/s), io=96.0GiB (103GB), run=2160-2160msec

All blocks are aligned by 2MB.

# filefrag -v /ment/pmem1/testfile
Filesystem type is: ef53
File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..   32767:      34816..     67583:  32768:
   1:    32768..   63487:      67584..     98303:  30720:
   2:    63488..   96255:     100352..    133119:  32768:      98304:
   3:    96256..  126975:     133120..    163839:  30720:
    :

# fallocate --length 1G /mnt/pmem1/data
# filefrag -v /mnt/pmem1/data
Filesystem type is: ef53
File size of /mnt/pmem1/data is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..   30719:      34816..     65535:  30720:   unwritten
   1:    30720..   61439:      65536..     96255:  30720:   unwritten
   2:    61440..   63487:      96256..     98303:   2048:   unwritten
   :

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* DAX 2MB mappings for XFS
@ 2018-01-12 19:40 ` Kani, Toshi
  0 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-12 19:40 UTC (permalink / raw)
  To: ross.zwisler, linux-nvdimm, david; +Cc: linux-fsdevel

Hello,

I noticed that DAX 2MB mmap no longer works on XFS.  I used the
following steps on a 4.15-rc7 kernel.  Am I missing something, or is
there a problem in XFS?

# mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
# mount -o dax /dev/pmem0 /mnt/pmem0
# xfs_io -c "extsize 2m" /mnt/pmem0

fio with libpmem engine (which uses mmap) is slow since it gets
serialized by 4KB page faults.

# numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem0/testfile 
--rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
direct=1
  :
Run status group 0 (all jobs):
   READ: bw=4357MiB/s (4569MB/s), 4357MiB/s-4357MiB/s (4569MB/s-
4569MB/s), io=96.0GiB (103GB), run=22560-22560msec

Resulted file blocks in "testfile" are not aligned by 2MB.

# filefrag -v /mnt/pmem0/testfile
Filesystem type is: 58465342
File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..  261111:        520..    261631: 261112:
   1:   261112..  261348:         12..       248:    237:     261632:
   2:   261349..  522705:     261644..    523000: 261357:        249:
   3:   522706..  784062:     523276..    784632: 261357:     523001:
   4:   784063.. 1045419:     784908..   1046264: 261357:     784633:
   5:  1045420.. 1304216:    1049100..   1307896: 258797:    1046265:
   6:  1304217.. 1565573:    1308172..   1569528: 261357:    1307897:
   7:  1565574.. 1572863:    1570304..   1577593:   7290:    1569529: 
last,eof
testfile: 8 extents found

A file created by fallocate also shows that physical offset starts from
520, which is not aligned by 2MB. 

# fallocate --length 1G /mnt/pmem0/data
# filefrag -v /mnt/pmem0/data
Filesystem type is: 58465342
File size of /mnt/pmem0/data is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..  260607:        520..    261127:
260608:             unwritten
   1:   260608..  262143:     262144..    263679:   1536:     261128:
last,unwritten,eof
/mnt/pmem0/data: 2 extents found

ext4 does not have the issue in the same steps.

# mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem1
# mount -o dax /dev/pmem1 /mnt/pmem1
# numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem1/testfile 
--rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
direct=1      
  :
Run status group 0 (all jobs):
   READ: bw=44.4GiB/s (47.7GB/s), 44.4GiB/s-44.4GiB/s (47.7GB/s-
47.7GB/s), io=96.0GiB (103GB), run=2160-2160msec

All blocks are aligned by 2MB.

# filefrag -v /ment/pmem1/testfile
Filesystem type is: ef53
File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..   32767:      34816..     67583:  32768:
   1:    32768..   63487:      67584..     98303:  30720:
   2:    63488..   96255:     100352..    133119:  32768:      98304:
   3:    96256..  126975:     133120..    163839:  30720:
    :

# fallocate --length 1G /mnt/pmem1/data
# filefrag -v /mnt/pmem1/data
Filesystem type is: ef53
File size of /mnt/pmem1/data is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..   30719:      34816..     65535:  30720:   unwritten
   1:    30720..   61439:      65536..     96255:  30720:   unwritten
   2:    61440..   63487:      96256..     98303:   2048:   unwritten
   :

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
  2018-01-12 19:40 ` Kani, Toshi
@ 2018-01-12 21:19   ` Dave Chinner
  -1 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-01-12 21:19 UTC (permalink / raw)
  To: Kani, Toshi; +Cc: linux-fsdevel, linux-nvdimm

On Fri, Jan 12, 2018 at 07:40:25PM +0000, Kani, Toshi wrote:
> Hello,
> 
> I noticed that DAX 2MB mmap no longer works on XFS.  I used the
> following steps on a 4.15-rc7 kernel.  Am I missing something, or is
> there a problem in XFS?
> 
> # mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
> # mount -o dax /dev/pmem0 /mnt/pmem0
> # xfs_io -c "extsize 2m" /mnt/pmem0
> 
> fio with libpmem engine (which uses mmap) is slow since it gets
> serialized by 4KB page faults.
> 
> # numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem0/testfile 
> --rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
> group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
> direct=1
>   :
> Run status group 0 (all jobs):
>    READ: bw=4357MiB/s (4569MB/s), 4357MiB/s-4357MiB/s (4569MB/s-
> 4569MB/s), io=96.0GiB (103GB), run=22560-22560msec
> 
> Resulted file blocks in "testfile" are not aligned by 2MB.
> 
> # filefrag -v /mnt/pmem0/testfile
> Filesystem type is: 58465342
> File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>    0:        0..  261111:        520..    261631: 261112:
>    1:   261112..  261348:         12..       248:    237:     261632:
>    2:   261349..  522705:     261644..    523000: 261357:        249:
>    3:   522706..  784062:     523276..    784632: 261357:     523001:
>    4:   784063.. 1045419:     784908..   1046264: 261357:     784633:
>    5:  1045420.. 1304216:    1049100..   1307896: 258797:    1046265:
>    6:  1304217.. 1565573:    1308172..   1569528: 261357:    1307897:
>    7:  1565574.. 1572863:    1570304..   1577593:   7290:    1569529: 
> last,eof
> testfile: 8 extents found
> 
> A file created by fallocate also shows that physical offset starts from
> 520, which is not aligned by 2MB. 
> 
> # fallocate --length 1G /mnt/pmem0/data
> # filefrag -v /mnt/pmem0/data
> Filesystem type is: 58465342
> File size of /mnt/pmem0/data is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>    0:        0..  260607:        520..    261127:
> 260608:             unwritten
>    1:   260608..  262143:     262144..    263679:   1536:     261128:
> last,unwritten,eof
> /mnt/pmem0/data: 2 extents found

/me really dislikes filefrag output.

$ sudo xfs_bmap -vvp /mnt/scratch/data
/mnt/scratch/data:
 EXT: FILE-OFFSET         BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..2088959]:       4160..2093119     0 (4160..2093119)  2088960 011111
   1: [2088960..2097151]: 2101248..2109439  1 (4096..12287)       8192 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width

Yeah, though so. The bmap output clearly tells me that the
allocation being asked for doesn't fit into a single AG, so it's
trimmed to fit.

To confirm this is the issue, let's do two smaller alllocations:

$ sudo rm /mnt/scratch/data
dave@test4:~$ sudo xfs_io -f -c "falloc 0 512m" -c "falloc 512m 512m" -c stat -c "bmap -vvp" /mnt/scratch/data
fd.path = "/mnt/scratch/data"
fd.flags = non-sync,non-direct,read-write
stat.ino = 4099
stat.type = regular file
stat.size = 1073741824
stat.blocks = 2097152
fsxattr.xflags = 0x802 [-p--------e------]
fsxattr.projid = 0
fsxattr.extsize = 2097152
fsxattr.cowextsize = 0
fsxattr.nextents = 2
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
/mnt/scratch/data:
 EXT: FILE-OFFSET         BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..1048575]:       8192..1056767     0 (8192..1056767)  1048576 010000
   1: [1048576..2097151]: 2101248..3149823  1 (4096..1052671)  1048576 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width

Yup, all blocks are 2MB aligned.

IOWs, what you are seeing is trying to do a very large allocation on
a very small (8GB) XFS filesystem.  It's rare someone asks to
allocate >25% of the filesystem space in one allocation, so it's not
surprising it triggers ENOSPC-like algorithms because it doesn't fit
into a single AG....

We can probably look to optimise this, but I'm not sure if we can
easily differentiate this case (i.e. allocation request larger than
continguous free space) from the same situation near ENOSPC when we
really do have to trim to fit...

Remember: stripe unit allocation alignment is a hint in XFS that we
can and do ignore when necessary - it's not a binding rule.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
@ 2018-01-12 21:19   ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-01-12 21:19 UTC (permalink / raw)
  To: Kani, Toshi; +Cc: ross.zwisler, linux-nvdimm, linux-fsdevel

On Fri, Jan 12, 2018 at 07:40:25PM +0000, Kani, Toshi wrote:
> Hello,
> 
> I noticed that DAX 2MB mmap no longer works on XFS.  I used the
> following steps on a 4.15-rc7 kernel.  Am I missing something, or is
> there a problem in XFS?
> 
> # mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
> # mount -o dax /dev/pmem0 /mnt/pmem0
> # xfs_io -c "extsize 2m" /mnt/pmem0
> 
> fio with libpmem engine (which uses mmap) is slow since it gets
> serialized by 4KB page faults.
> 
> # numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem0/testfile 
> --rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
> group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
> direct=1
>   :
> Run status group 0 (all jobs):
>    READ: bw=4357MiB/s (4569MB/s), 4357MiB/s-4357MiB/s (4569MB/s-
> 4569MB/s), io=96.0GiB (103GB), run=22560-22560msec
> 
> Resulted file blocks in "testfile" are not aligned by 2MB.
> 
> # filefrag -v /mnt/pmem0/testfile
> Filesystem type is: 58465342
> File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>    0:        0..  261111:        520..    261631: 261112:
>    1:   261112..  261348:         12..       248:    237:     261632:
>    2:   261349..  522705:     261644..    523000: 261357:        249:
>    3:   522706..  784062:     523276..    784632: 261357:     523001:
>    4:   784063.. 1045419:     784908..   1046264: 261357:     784633:
>    5:  1045420.. 1304216:    1049100..   1307896: 258797:    1046265:
>    6:  1304217.. 1565573:    1308172..   1569528: 261357:    1307897:
>    7:  1565574.. 1572863:    1570304..   1577593:   7290:    1569529: 
> last,eof
> testfile: 8 extents found
> 
> A file created by fallocate also shows that physical offset starts from
> 520, which is not aligned by 2MB. 
> 
> # fallocate --length 1G /mnt/pmem0/data
> # filefrag -v /mnt/pmem0/data
> Filesystem type is: 58465342
> File size of /mnt/pmem0/data is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>    0:        0..  260607:        520..    261127:
> 260608:             unwritten
>    1:   260608..  262143:     262144..    263679:   1536:     261128:
> last,unwritten,eof
> /mnt/pmem0/data: 2 extents found

/me really dislikes filefrag output.

$ sudo xfs_bmap -vvp /mnt/scratch/data
/mnt/scratch/data:
 EXT: FILE-OFFSET         BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..2088959]:       4160..2093119     0 (4160..2093119)  2088960 011111
   1: [2088960..2097151]: 2101248..2109439  1 (4096..12287)       8192 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width

Yeah, though so. The bmap output clearly tells me that the
allocation being asked for doesn't fit into a single AG, so it's
trimmed to fit.

To confirm this is the issue, let's do two smaller alllocations:

$ sudo rm /mnt/scratch/data
dave@test4:~$ sudo xfs_io -f -c "falloc 0 512m" -c "falloc 512m 512m" -c stat -c "bmap -vvp" /mnt/scratch/data
fd.path = "/mnt/scratch/data"
fd.flags = non-sync,non-direct,read-write
stat.ino = 4099
stat.type = regular file
stat.size = 1073741824
stat.blocks = 2097152
fsxattr.xflags = 0x802 [-p--------e------]
fsxattr.projid = 0
fsxattr.extsize = 2097152
fsxattr.cowextsize = 0
fsxattr.nextents = 2
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
/mnt/scratch/data:
 EXT: FILE-OFFSET         BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..1048575]:       8192..1056767     0 (8192..1056767)  1048576 010000
   1: [1048576..2097151]: 2101248..3149823  1 (4096..1052671)  1048576 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width

Yup, all blocks are 2MB aligned.

IOWs, what you are seeing is trying to do a very large allocation on
a very small (8GB) XFS filesystem.  It's rare someone asks to
allocate >25% of the filesystem space in one allocation, so it's not
surprising it triggers ENOSPC-like algorithms because it doesn't fit
into a single AG....

We can probably look to optimise this, but I'm not sure if we can
easily differentiate this case (i.e. allocation request larger than
continguous free space) from the same situation near ENOSPC when we
really do have to trim to fit...

Remember: stripe unit allocation alignment is a hint in XFS that we
can and do ignore when necessary - it's not a binding rule.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
  2018-01-12 21:19   ` Dave Chinner
@ 2018-01-12 21:38     ` Kani, Toshi
  -1 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-12 21:38 UTC (permalink / raw)
  To: david; +Cc: linux-fsdevel, linux-nvdimm

On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
 :
> IOWs, what you are seeing is trying to do a very large allocation on
> a very small (8GB) XFS filesystem.  It's rare someone asks to
> allocate >25% of the filesystem space in one allocation, so it's not
> surprising it triggers ENOSPC-like algorithms because it doesn't fit
> into a single AG....
> 
> We can probably look to optimise this, but I'm not sure if we can
> easily differentiate this case (i.e. allocation request larger than
> continguous free space) from the same situation near ENOSPC when we
> really do have to trim to fit...
> 
> Remember: stripe unit allocation alignment is a hint in XFS that we
> can and do ignore when necessary - it's not a binding rule.

Thanks for the clarification!  Can XFS allocate smaller extents so that
each extent will fit to an AG?  ext4 creates multiple smaller extents
for the same request.

-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
@ 2018-01-12 21:38     ` Kani, Toshi
  0 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-12 21:38 UTC (permalink / raw)
  To: david; +Cc: ross.zwisler, linux-nvdimm, linux-fsdevel

On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
 :
> IOWs, what you are seeing is trying to do a very large allocation on
> a very small (8GB) XFS filesystem.  It's rare someone asks to
> allocate >25% of the filesystem space in one allocation, so it's not
> surprising it triggers ENOSPC-like algorithms because it doesn't fit
> into a single AG....
> 
> We can probably look to optimise this, but I'm not sure if we can
> easily differentiate this case (i.e. allocation request larger than
> continguous free space) from the same situation near ENOSPC when we
> really do have to trim to fit...
> 
> Remember: stripe unit allocation alignment is a hint in XFS that we
> can and do ignore when necessary - it's not a binding rule.

Thanks for the clarification!  Can XFS allocate smaller extents so that
each extent will fit to an AG?  ext4 creates multiple smaller extents
for the same request.

-Toshi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
  2018-01-12 21:38     ` Kani, Toshi
@ 2018-01-12 22:27       ` Dave Chinner
  -1 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-01-12 22:27 UTC (permalink / raw)
  To: Kani, Toshi; +Cc: linux-fsdevel, linux-nvdimm

On Fri, Jan 12, 2018 at 09:38:22PM +0000, Kani, Toshi wrote:
> On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
>  :
> > IOWs, what you are seeing is trying to do a very large allocation on
> > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > allocate >25% of the filesystem space in one allocation, so it's not
> > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > into a single AG....
> > 
> > We can probably look to optimise this, but I'm not sure if we can
> > easily differentiate this case (i.e. allocation request larger than
> > continguous free space) from the same situation near ENOSPC when we
> > really do have to trim to fit...
> > 
> > Remember: stripe unit allocation alignment is a hint in XFS that we
> > can and do ignore when necessary - it's not a binding rule.
> 
> Thanks for the clarification!  Can XFS allocate smaller extents so that
> each extent will fit to an AG?

I've already answered that question:

	I'm not sure if we can easily differentiate this case (i.e.
	allocation request larger than continguous free space) from
	the same situation near ENOSPC when we really do have to
	trim to fit...

> ext4 creates multiple smaller extents for the same request.

Yes, because it has much, much smaller block groups so "allocation >
max extent size (128MB)" is a common path.

It's not a common path on XFS - filesystems (and hence AGs) are
typically orders of magnitude larger than the maximum extent size
(8GB) so the problem only shows up when we're near ENOSPC. XFS is
really not optimised for tiny filesystems, and when it comes to pmem
we were lead to beleive we'd have mutliple terabytes of pmem in
systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
spent very little time worrying about such issues because we
weren't aiming to support such small capcities for very long...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
@ 2018-01-12 22:27       ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-01-12 22:27 UTC (permalink / raw)
  To: Kani, Toshi; +Cc: ross.zwisler, linux-nvdimm, linux-fsdevel

On Fri, Jan 12, 2018 at 09:38:22PM +0000, Kani, Toshi wrote:
> On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
>  :
> > IOWs, what you are seeing is trying to do a very large allocation on
> > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > allocate >25% of the filesystem space in one allocation, so it's not
> > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > into a single AG....
> > 
> > We can probably look to optimise this, but I'm not sure if we can
> > easily differentiate this case (i.e. allocation request larger than
> > continguous free space) from the same situation near ENOSPC when we
> > really do have to trim to fit...
> > 
> > Remember: stripe unit allocation alignment is a hint in XFS that we
> > can and do ignore when necessary - it's not a binding rule.
> 
> Thanks for the clarification!  Can XFS allocate smaller extents so that
> each extent will fit to an AG?

I've already answered that question:

	I'm not sure if we can easily differentiate this case (i.e.
	allocation request larger than continguous free space) from
	the same situation near ENOSPC when we really do have to
	trim to fit...

> ext4 creates multiple smaller extents for the same request.

Yes, because it has much, much smaller block groups so "allocation >
max extent size (128MB)" is a common path.

It's not a common path on XFS - filesystems (and hence AGs) are
typically orders of magnitude larger than the maximum extent size
(8GB) so the problem only shows up when we're near ENOSPC. XFS is
really not optimised for tiny filesystems, and when it comes to pmem
we were lead to beleive we'd have mutliple terabytes of pmem in
systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
spent very little time worrying about such issues because we
weren't aiming to support such small capcities for very long...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
  2018-01-12 22:27       ` Dave Chinner
@ 2018-01-12 23:15         ` Kani, Toshi
  -1 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-12 23:15 UTC (permalink / raw)
  To: david; +Cc: linux-fsdevel, linux-nvdimm

On Sat, 2018-01-13 at 09:27 +1100, Dave Chinner wrote:
> On Fri, Jan 12, 2018 at 09:38:22PM +0000, Kani, Toshi wrote:
> > On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
> >  :
> > > IOWs, what you are seeing is trying to do a very large allocation on
> > > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > > allocate >25% of the filesystem space in one allocation, so it's not
> > > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > > into a single AG....
> > > 
> > > We can probably look to optimise this, but I'm not sure if we can
> > > easily differentiate this case (i.e. allocation request larger than
> > > continguous free space) from the same situation near ENOSPC when we
> > > really do have to trim to fit...
> > > 
> > > Remember: stripe unit allocation alignment is a hint in XFS that we
> > > can and do ignore when necessary - it's not a binding rule.
> > 
> > Thanks for the clarification!  Can XFS allocate smaller extents so that
> > each extent will fit to an AG?
> 
> I've already answered that question:
> 
> 	I'm not sure if we can easily differentiate this case (i.e.
> 	allocation request larger than continguous free space) from
> 	the same situation near ENOSPC when we really do have to
> 	trim to fit...

Right.  I was thinking to limit the extent size (i.e. a half or quarter
of AG size) regardless of the ENOSPC condition, but it may be the same
thing.

> > ext4 creates multiple smaller extents for the same request.
> 
> Yes, because it has much, much smaller block groups so "allocation >
> max extent size (128MB)" is a common path.
> 
> It's not a common path on XFS - filesystems (and hence AGs) are
> typically orders of magnitude larger than the maximum extent size
> (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> really not optimised for tiny filesystems, and when it comes to pmem
> we were lead to beleive we'd have mutliple terabytes of pmem in
> systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> spent very little time worrying about such issues because we
> weren't aiming to support such small capcities for very long...

I see.  Yes, there will be multiple terabytes capacity, but it will also
allow to divide it into multiple smaller namespaces.  So, user may
continue to have relatively smaller namespaces for their use cases.  If
user allocates a namespace that is just big enough to host several
active files, it may hit this issue regardless of their size.

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
@ 2018-01-12 23:15         ` Kani, Toshi
  0 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-12 23:15 UTC (permalink / raw)
  To: david; +Cc: ross.zwisler, linux-nvdimm, linux-fsdevel

On Sat, 2018-01-13 at 09:27 +1100, Dave Chinner wrote:
> On Fri, Jan 12, 2018 at 09:38:22PM +0000, Kani, Toshi wrote:
> > On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
> >  :
> > > IOWs, what you are seeing is trying to do a very large allocation on
> > > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > > allocate >25% of the filesystem space in one allocation, so it's not
> > > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > > into a single AG....
> > > 
> > > We can probably look to optimise this, but I'm not sure if we can
> > > easily differentiate this case (i.e. allocation request larger than
> > > continguous free space) from the same situation near ENOSPC when we
> > > really do have to trim to fit...
> > > 
> > > Remember: stripe unit allocation alignment is a hint in XFS that we
> > > can and do ignore when necessary - it's not a binding rule.
> > 
> > Thanks for the clarification!  Can XFS allocate smaller extents so that
> > each extent will fit to an AG?
> 
> I've already answered that question:
> 
> 	I'm not sure if we can easily differentiate this case (i.e.
> 	allocation request larger than continguous free space) from
> 	the same situation near ENOSPC when we really do have to
> 	trim to fit...

Right.  I was thinking to limit the extent size (i.e. a half or quarter
of AG size) regardless of the ENOSPC condition, but it may be the same
thing.

> > ext4 creates multiple smaller extents for the same request.
> 
> Yes, because it has much, much smaller block groups so "allocation >
> max extent size (128MB)" is a common path.
> 
> It's not a common path on XFS - filesystems (and hence AGs) are
> typically orders of magnitude larger than the maximum extent size
> (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> really not optimised for tiny filesystems, and when it comes to pmem
> we were lead to beleive we'd have mutliple terabytes of pmem in
> systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> spent very little time worrying about such issues because we
> weren't aiming to support such small capcities for very long...

I see.  Yes, there will be multiple terabytes capacity, but it will also
allow to divide it into multiple smaller namespaces.  So, user may
continue to have relatively smaller namespaces for their use cases.  If
user allocates a namespace that is just big enough to host several
active files, it may hit this issue regardless of their size.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
  2018-01-12 23:15         ` Kani, Toshi
@ 2018-01-12 23:52           ` Darrick J. Wong
  -1 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2018-01-12 23:52 UTC (permalink / raw)
  To: Kani, Toshi; +Cc: linux-fsdevel, david, linux-nvdimm

On Fri, Jan 12, 2018 at 11:15:00PM +0000, Kani, Toshi wrote:
> On Sat, 2018-01-13 at 09:27 +1100, Dave Chinner wrote:
> > On Fri, Jan 12, 2018 at 09:38:22PM +0000, Kani, Toshi wrote:
> > > On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
> > >  :
> > > > IOWs, what you are seeing is trying to do a very large allocation on
> > > > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > > > allocate >25% of the filesystem space in one allocation, so it's not
> > > > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > > > into a single AG....
> > > > 
> > > > We can probably look to optimise this, but I'm not sure if we can
> > > > easily differentiate this case (i.e. allocation request larger than
> > > > continguous free space) from the same situation near ENOSPC when we
> > > > really do have to trim to fit...
> > > > 
> > > > Remember: stripe unit allocation alignment is a hint in XFS that we
> > > > can and do ignore when necessary - it's not a binding rule.
> > > 
> > > Thanks for the clarification!  Can XFS allocate smaller extents so that
> > > each extent will fit to an AG?
> > 
> > I've already answered that question:
> > 
> > 	I'm not sure if we can easily differentiate this case (i.e.
> > 	allocation request larger than continguous free space) from
> > 	the same situation near ENOSPC when we really do have to
> > 	trim to fit...
> 
> Right.  I was thinking to limit the extent size (i.e. a half or quarter
> of AG size) regardless of the ENOSPC condition, but it may be the same
> thing.
> 
> > > ext4 creates multiple smaller extents for the same request.
> > 
> > Yes, because it has much, much smaller block groups so "allocation >
> > max extent size (128MB)" is a common path.
> > 
> > It's not a common path on XFS - filesystems (and hence AGs) are
> > typically orders of magnitude larger than the maximum extent size
> > (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> > really not optimised for tiny filesystems, and when it comes to pmem
> > we were lead to beleive we'd have mutliple terabytes of pmem in
> > systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> > spent very little time worrying about such issues because we
> > weren't aiming to support such small capcities for very long...
> 
> I see.  Yes, there will be multiple terabytes capacity, but it will also
> allow to divide it into multiple smaller namespaces.  So, user may
> continue to have relatively smaller namespaces for their use cases.  If
> user allocates a namespace that is just big enough to host several
> active files, it may hit this issue regardless of their size.

I am curious, why not just give XFS all the space and let it manage the space?

--D

> Thanks,
> -Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
@ 2018-01-12 23:52           ` Darrick J. Wong
  0 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2018-01-12 23:52 UTC (permalink / raw)
  To: Kani, Toshi; +Cc: david, ross.zwisler, linux-nvdimm, linux-fsdevel

On Fri, Jan 12, 2018 at 11:15:00PM +0000, Kani, Toshi wrote:
> On Sat, 2018-01-13 at 09:27 +1100, Dave Chinner wrote:
> > On Fri, Jan 12, 2018 at 09:38:22PM +0000, Kani, Toshi wrote:
> > > On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
> > >  :
> > > > IOWs, what you are seeing is trying to do a very large allocation on
> > > > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > > > allocate >25% of the filesystem space in one allocation, so it's not
> > > > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > > > into a single AG....
> > > > 
> > > > We can probably look to optimise this, but I'm not sure if we can
> > > > easily differentiate this case (i.e. allocation request larger than
> > > > continguous free space) from the same situation near ENOSPC when we
> > > > really do have to trim to fit...
> > > > 
> > > > Remember: stripe unit allocation alignment is a hint in XFS that we
> > > > can and do ignore when necessary - it's not a binding rule.
> > > 
> > > Thanks for the clarification!  Can XFS allocate smaller extents so that
> > > each extent will fit to an AG?
> > 
> > I've already answered that question:
> > 
> > 	I'm not sure if we can easily differentiate this case (i.e.
> > 	allocation request larger than continguous free space) from
> > 	the same situation near ENOSPC when we really do have to
> > 	trim to fit...
> 
> Right.  I was thinking to limit the extent size (i.e. a half or quarter
> of AG size) regardless of the ENOSPC condition, but it may be the same
> thing.
> 
> > > ext4 creates multiple smaller extents for the same request.
> > 
> > Yes, because it has much, much smaller block groups so "allocation >
> > max extent size (128MB)" is a common path.
> > 
> > It's not a common path on XFS - filesystems (and hence AGs) are
> > typically orders of magnitude larger than the maximum extent size
> > (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> > really not optimised for tiny filesystems, and when it comes to pmem
> > we were lead to beleive we'd have mutliple terabytes of pmem in
> > systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> > spent very little time worrying about such issues because we
> > weren't aiming to support such small capcities for very long...
> 
> I see.  Yes, there will be multiple terabytes capacity, but it will also
> allow to divide it into multiple smaller namespaces.  So, user may
> continue to have relatively smaller namespaces for their use cases.  If
> user allocates a namespace that is just big enough to host several
> active files, it may hit this issue regardless of their size.

I am curious, why not just give XFS all the space and let it manage the space?

--D

> Thanks,
> -Toshi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
  2018-01-12 23:52           ` Darrick J. Wong
@ 2018-01-13  0:05             ` Kani, Toshi
  -1 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-13  0:05 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-fsdevel, david, linux-nvdimm

On Fri, 2018-01-12 at 15:52 -0800, Darrick J. Wong wrote:
> On Fri, Jan 12, 2018 at 11:15:00PM +0000, Kani, Toshi wrote:
 :
> > > > ext4 creates multiple smaller extents for the same request.
> > > 
> > > Yes, because it has much, much smaller block groups so "allocation >
> > > max extent size (128MB)" is a common path.
> > > 
> > > It's not a common path on XFS - filesystems (and hence AGs) are
> > > typically orders of magnitude larger than the maximum extent size
> > > (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> > > really not optimised for tiny filesystems, and when it comes to pmem
> > > we were lead to beleive we'd have mutliple terabytes of pmem in
> > > systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> > > spent very little time worrying about such issues because we
> > > weren't aiming to support such small capcities for very long...
> > 
> > I see.  Yes, there will be multiple terabytes capacity, but it will also
> > allow to divide it into multiple smaller namespaces.  So, user may
> > continue to have relatively smaller namespaces for their use cases.  If
> > user allocates a namespace that is just big enough to host several
> > active files, it may hit this issue regardless of their size.
> 
> I am curious, why not just give XFS all the space and let it manage the space?

Well, I am not sure if having multiple namespaces would be popular use
cases.  But it could be useful when a system hosts multiple guests or
containers that require isolation in storage space.

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: DAX 2MB mappings for XFS
@ 2018-01-13  0:05             ` Kani, Toshi
  0 siblings, 0 replies; 14+ messages in thread
From: Kani, Toshi @ 2018-01-13  0:05 UTC (permalink / raw)
  To: darrick.wong; +Cc: ross.zwisler, david, linux-nvdimm, linux-fsdevel

On Fri, 2018-01-12 at 15:52 -0800, Darrick J. Wong wrote:
> On Fri, Jan 12, 2018 at 11:15:00PM +0000, Kani, Toshi wrote:
 :
> > > > ext4 creates multiple smaller extents for the same request.
> > > 
> > > Yes, because it has much, much smaller block groups so "allocation >
> > > max extent size (128MB)" is a common path.
> > > 
> > > It's not a common path on XFS - filesystems (and hence AGs) are
> > > typically orders of magnitude larger than the maximum extent size
> > > (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> > > really not optimised for tiny filesystems, and when it comes to pmem
> > > we were lead to beleive we'd have mutliple terabytes of pmem in
> > > systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> > > spent very little time worrying about such issues because we
> > > weren't aiming to support such small capcities for very long...
> > 
> > I see.  Yes, there will be multiple terabytes capacity, but it will also
> > allow to divide it into multiple smaller namespaces.  So, user may
> > continue to have relatively smaller namespaces for their use cases.  If
> > user allocates a namespace that is just big enough to host several
> > active files, it may hit this issue regardless of their size.
> 
> I am curious, why not just give XFS all the space and let it manage the space?

Well, I am not sure if having multiple namespaces would be popular use
cases.  But it could be useful when a system hosts multiple guests or
containers that require isolation in storage space.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-01-13  0:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-12 19:40 DAX 2MB mappings for XFS Kani, Toshi
2018-01-12 19:40 ` Kani, Toshi
2018-01-12 21:19 ` Dave Chinner
2018-01-12 21:19   ` Dave Chinner
2018-01-12 21:38   ` Kani, Toshi
2018-01-12 21:38     ` Kani, Toshi
2018-01-12 22:27     ` Dave Chinner
2018-01-12 22:27       ` Dave Chinner
2018-01-12 23:15       ` Kani, Toshi
2018-01-12 23:15         ` Kani, Toshi
2018-01-12 23:52         ` Darrick J. Wong
2018-01-12 23:52           ` Darrick J. Wong
2018-01-13  0:05           ` Kani, Toshi
2018-01-13  0:05             ` Kani, Toshi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.