All of lore.kernel.org
 help / color / mirror / Atom feed
* ENSOPC on a 10% used disk
@ 2018-10-17  7:52 Avi Kivity
  2018-10-17  8:47 ` Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-17  7:52 UTC (permalink / raw)
  To: linux-xfs

I have a user running a 1.7TB filesystem with ~10% usage (as shown by 
df), getting sporadic ENOSPC errors. The disk is mounted with inode64 
and has a relatively small number of large files. The disk is a 
single-member RAID0 array, with 1MB chunk size. There are 32 AGs. 
Running Linux 4.9.17.


The write load consists of AIO/DIO writes, followed by unlinks of these 
files. The writes are non-size-changing (we truncate ahead) and we use 
XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors 
happen on commit logs, which have a target size of 32MB (but may exceed 
it a little).


The errors are sporadic and after restarting the workload they go away 
for a few hours to a few days, but then return. During one of the 
crashes I used xfs_db to look at fragmentation and saw that most AGs had 
free extents of size categories up to 128-255, but a few had more. I 
tried xfs_fsr but it did not help.


Is this a known issue? Would upgrading the kernel help?


I'll try to get a metadata dump next time this happens, and I'll be 
happy to supply more information.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-17  7:52 ENSOPC on a 10% used disk Avi Kivity
@ 2018-10-17  8:47 ` Christoph Hellwig
  2018-10-17  8:57   ` Avi Kivity
  2018-10-18  1:37 ` Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-10-17  8:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df),
> getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a
> relatively small number of large files. The disk is a single-member RAID0
> array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.

4.9.17 is rather old and you'll have a hard time finding someone
familiar with it..

> Is this a known issue? Would upgrading the kernel help?

Two things that come to mind:

 - are you sure there is no open fd to the unlinked files?  That would
   keep the space allocated until the last link is dropped.
 - even once we drop the inode the space only becomes available once
   the transaction has committed.  We do force the log if we found
   a busy extent, but there might be some issues.  Try seeing if you
   hit the xfs_extent_busy_force trace point with your workload.
 - if you have online discard (-o discard) enabled there might be
   more issues like the above, especially on old kernels.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-17  8:47 ` Christoph Hellwig
@ 2018-10-17  8:57   ` Avi Kivity
  2018-10-17 10:54     ` Avi Kivity
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-17  8:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs


On 17/10/2018 11.47, Christoph Hellwig wrote:
> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df),
>> getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a
>> relatively small number of large files. The disk is a single-member RAID0
>> array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
> 4.9.17 is rather old and you'll have a hard time finding someone
> familiar with it..


Yes. I expect my user will agree to upgrade, but I'd like to recommend 
this only if we know there was a real issue and it was resolved, not on 
general principles.


>> Is this a known issue? Would upgrading the kernel help?
> Two things that come to mind:
>
>   - are you sure there is no open fd to the unlinked files?  That would
>     keep the space allocated until the last link is dropped.


"df" would report that space as occupied, no?


I believe a colleague verified there were no deleted files but I'm not 
100% sure.


>   - even once we drop the inode the space only becomes available once
>     the transaction has committed.  We do force the log if we found
>     a busy extent, but there might be some issues.  Try seeing if you
>     hit the xfs_extent_busy_force trace point with your workload.


I'll ask permission to check this and report.


>   - if you have online discard (-o discard) enabled there might be
>     more issues like the above, especially on old kernels.


Online discard is not enabled:


/dev/md0 on /var/lib/scylla type xfs 
(rw,noatime,attr2,inode64,sunit=2048,swidth=2048,noquota)

btw, we've seen fstrim on an old disk (that was likely never trimmed) 
improving its performance by a factor of ~100, so my interest in -o 
discard is re-awakening. Is it good enough now to to run on aio 
workloads (assuming nvme) or is more work needed? My prime concern is to 
avoid io_submit sleeping.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-17  8:57   ` Avi Kivity
@ 2018-10-17 10:54     ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-17 10:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs


On 17/10/2018 11.57, Avi Kivity wrote:
>
>>   - even once we drop the inode the space only becomes available once
>>     the transaction has committed.  We do force the log if we found
>>     a busy extent, but there might be some issues.  Try seeing if you
>>     hit the xfs_extent_busy_force trace point with your workload.
>
>
> I'll ask permission to check this and report.
>
>


An hour's tracing yielded zero hits. Of course, that says nothing about 
other times, I'll continue to trace.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-17  7:52 ENSOPC on a 10% used disk Avi Kivity
  2018-10-17  8:47 ` Christoph Hellwig
@ 2018-10-18  1:37 ` Dave Chinner
  2018-10-18  7:55   ` Avi Kivity
  2018-10-18 15:54 ` Eric Sandeen
  2019-02-05 21:48 ` Dave Chinner
  3 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-18  1:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown
> by df), getting sporadic ENOSPC errors. The disk is mounted with
> inode64 and has a relatively small number of large files. The disk
> is a single-member RAID0 array, with 1MB chunk size. There are 32
> AGs. Running Linux 4.9.17.

ENOSPC on what operation? write? open(O_CREAT)? something else?

What's the filesystem config (xfs_info output)?

> The write load consists of AIO/DIO writes, followed by unlinks of
> these files. The writes are non-size-changing (we truncate ahead)
> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
> 32MB. The errors happen on commit logs, which have a target size of
> 32MB (but may exceed it a little).
> 
> 
> The errors are sporadic and after restarting the workload they go
> away for a few hours to a few days, but then return. During one of
> the crashes I used xfs_db to look at fragmentation and saw that most
> AGs had free extents of size categories up to 128-255, but a few had
> more. I tried xfs_fsr but it did not help.

32MB extents are 8192 blocks. The bucket 128-255 records extents
between 512k and 1MB in size, so it sounds like free space has been
fragmented to death. Has xfs_fsr been run on this filesystem
regularly?

If the ENOSPC errors are only from files with a 32MB extent size
hints on them, then it may be that there isn't sufficient contiguous
free space to allocate an entire 32MB extent. I'm not sure what the
allocator behaviour here is (the code is a maze of twisty passages),
so I'll have to look more into this.

In the mean time, can you post the output of the freespace command
(both global and per-ag) so we can see just how much free space
there is and how badly fragmented it has become? I might be able to
reproduce the behaviour if I know the conditions under which it is
occuring.

> Is this a known issue? Would upgrading the kernel help?

Not that I know of. If it's an extszhint vs free space fragmentation
issue, then a kernel upgrade is unlikely to fix it.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18  1:37 ` Dave Chinner
@ 2018-10-18  7:55   ` Avi Kivity
  2018-10-18 10:05     ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-18  7:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 18/10/2018 04.37, Dave Chinner wrote:
> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>> inode64 and has a relatively small number of large files. The disk
>> is a single-member RAID0 array, with 1MB chunk size. There are 32
>> AGs. Running Linux 4.9.17.
> ENOSPC on what operation? write? open(O_CREAT)? something else?


Unknown.


> What's the filesystem config (xfs_info output)?


(restored from metadata dump)


meta-data=/dev/loop2             isize=512    agcount=32, 
agsize=14494720 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=0 spinodes=0 rmapbt=0
          =                       reflink=0
data     =                       bsize=4096   blocks=463831040, imaxpct=5
          =                       sunit=256    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=226480, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


>> The write load consists of AIO/DIO writes, followed by unlinks of
>> these files. The writes are non-size-changing (we truncate ahead)
>> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
>> 32MB. The errors happen on commit logs, which have a target size of
>> 32MB (but may exceed it a little).
>>
>>
>> The errors are sporadic and after restarting the workload they go
>> away for a few hours to a few days, but then return. During one of
>> the crashes I used xfs_db to look at fragmentation and saw that most
>> AGs had free extents of size categories up to 128-255, but a few had
>> more. I tried xfs_fsr but it did not help.
> 32MB extents are 8192 blocks. The bucket 128-255 records extents
> between 512k and 1MB in size, so it sounds like free space has been
> fragmented to death. Has xfs_fsr been run on this filesystem
> regularly?


xfs_fsr has never been run, until we saw the problem (and then did not 
fix it). IIUC the workload should be self-defragmenting: it consists of 
writing large files, then erasing them. I estimate that around 100 files 
are written concurrently (from 14 threads), and they are written with 
large extent hints. With every large file, another smaller (but still 
large) file is written, and a few smallish metadata files.


I understood from xfs_fsr that it attempts to defragment files, not free 
space, although that may come as a side effect. In any case I ran xfs_db 
after xfs_fsr and did not see an improvement.


>
> If the ENOSPC errors are only from files with a 32MB extent size
> hints on them, then it may be that there isn't sufficient contiguous
> free space to allocate an entire 32MB extent. I'm not sure what the
> allocator behaviour here is (the code is a maze of twisty passages),
> so I'll have to look more into this.


There are other files with 32MB hints that do not show the error (but on 
the other hand, the error has been observed few enough times for that to 
be a fluke).


>
> In the mean time, can you post the output of the freespace command
> (both global and per-ag) so we can see just how much free space
> there is and how badly fragmented it has become? I might be able to
> reproduce the behaviour if I know the conditions under which it is
> occuring.


xfs_db> freesp
    from      to extents  blocks    pct
       1       1    5916    5916   0.00
       2       3   10235   22678   0.01
       4       7   12251   66829   0.02
       8      15    5521   59556   0.01
      16      31    5703  132031   0.03
      32      63    9754  463825   0.11
      64     127   16742 1590339   0.37
     128     255 1550511 390108625  89.87
     256     511   71516 29178504   6.72
     512    1023      19   15355   0.00
    1024    2047     287  461824   0.11
    2048    4095     528 1611413   0.37
    4096    8191    1537 10352304   2.38
    8192   16383       2   19015   0.00

Just 2 extents >= 32MB (and they may have been freed after the error).


Per-ag:


    from      to extents  blocks    pct
       1       1     390     390   0.00
       2       3     542    1215   0.01
       4       7     590    3211   0.02
       8      15     265    2735   0.02
      16      31     219    5000   0.04
      32      63     323   15530   0.11
      64     127     620   58217   0.43
     128     255   48677 12254686  90.27
     256     511    2981 1234365   9.09
    from      to extents  blocks    pct
       1       1     542     542   0.00
       2       3     646    1495   0.01
       4       7     592    3122   0.02
       8      15     525    5937   0.04
      16      31     539   12280   0.09
      32      63     691   33226   0.25
      64     127     851   78277   0.59
     128     255   46390 11658684  88.21
     256     511    3335 1422955  10.77
    from      to extents  blocks    pct
       1       1     560     560   0.00
       2       3     642    1454   0.01
       4       7     483    2552   0.02
       8      15     368    4020   0.03
      16      31     440    9947   0.08
      32      63     540   25347   0.21
      64     127     733   67944   0.56
     128     255   42337 10632366  87.06
     256     511    3386 1438609  11.78
     512    1023       5    4423   0.04
    1024    2047       5    8649   0.07
    2048    4095       3    9205   0.08
    4096    8191       1    8191   0.07
    from      to extents  blocks    pct
       1       1     662     662   0.01
       2       3     675    1545   0.02
       4       7     490    2483   0.03
       8      15     414    4485   0.05
      16      31     445    9915   0.11
      32      63     540   25279   0.29
      64     127     683   63014   0.72
     128     255   10061 2483774  28.34
     256     511    1498  574685   6.56
     512    1023       9    6715   0.08
    1024    2047       5    6967   0.08
    2048    4095     100  354101   4.04
    4096    8191     786 5229818  59.68
    from      to extents  blocks    pct
       1       1     642     642   0.01
       2       3     705    1599   0.02
       4       7     545    2801   0.04
       8      15     407    4320   0.05
      16      31     410    9396   0.12
      32      63     513   24294   0.31
      64     127     528   48217   0.61
     128     255    2723  644939   8.17
     256     511     875  326064   4.13
     512    1023       5    4217   0.05
    1024    2047     277  446208   5.65
    2048    4095     425 1248107  15.81
    4096    8191     750 5114295  64.79
    8192   16383       2   19015   0.24
    from      to extents  blocks    pct
       1       1     176     176   0.00
       2       3     484    1228   0.01
       4       7     825    4277   0.03
       8      15      73     870   0.01
      16      31     174    4155   0.03
      32      63     356   16746   0.12
      64     127     597   58761   0.42
     128     255   55401 13814803  99.38
    from      to extents  blocks    pct
       1       1     182     182   0.00
       2       3     212     444   0.00
       4       7      32     188   0.00
       8      15      58     692   0.00
      16      31     102    2369   0.02
      32      63     243   11756   0.08
      64     127     449   43271   0.30
     128     255   53882 13618288  95.22
     256     511    1550  625387   4.37
    from      to extents  blocks    pct
       1       1     147     147   0.00
       2       3     203     426   0.00
       4       7     287    1585   0.01
       8      15      84     958   0.01
      16      31     105    2370   0.02
      32      63     243   12073   0.09
      64     127     497   47704   0.34
     128     255   51847 13080484  94.15
     256     511    1897  747986   5.38
    from      to extents  blocks    pct
       1       1      81      81   0.00
       2       3     129     262   0.00
       4       7     186    1070   0.01
       8      15     148    1781   0.01
      16      31     225    5411   0.04
      32      63     257   12226   0.09
      64     127     492   46230   0.33
     128     255   53802 13533984  95.16
     256     511    1574  621876   4.37
    from      to extents  blocks    pct
       1       1     159     159   0.00
       2       3     191     398   0.00
       4       7     182    1009   0.01
       8      15      63     730   0.01
      16      31      88    2006   0.01
      32      63     191    9044   0.06
      64     127     494   46669   0.33
     128     255   53441 13451913  94.51
     256     511    1850  720941   5.07
    from      to extents  blocks    pct
       1       1     156     156   0.00
       2       3     192     397   0.00
       4       7     169     948   0.01
       8      15      67     780   0.01
      16      31     115    2948   0.02
      32      63     272   12564   0.09
      64     127     511   49124   0.35
     128     255   53339 13427444  94.42
     256     511    1866  726347   5.11
    from      to extents  blocks    pct
       1       1     157     157   0.00
       2       3     171     364   0.00
       4       7     221    1215   0.01
       8      15      45     504   0.00
      16      31     116    2628   0.02
      32      63     249   11827   0.08
      64     127     474   47158   0.33
     128     255   53261 13409025  94.35
     256     511    1886  738689   5.20
    from      to extents  blocks    pct
       1       1     142     142   0.00
       2       3     181     395   0.00
       4       7     323    1753   0.01
       8      15     108    1176   0.01
      16      31     134    3069   0.02
      32      63     260   12055   0.08
      64     127     411   39107   0.28
     128     255   53197 13389340  94.39
     256     511    1877  737582   5.20
    from      to extents  blocks    pct
       1       1     137     137   0.00
       2       3     174     386   0.00
       4       7     222    1232   0.01
       8      15      93    1012   0.01
      16      31      96    2192   0.02
      32      63     223   10763   0.08
      64     127     493   47665   0.34
     128     255   53125 13374075  94.17
     256     511    1949  764710   5.38
    from      to extents  blocks    pct
       1       1      59      59   0.00
       2       3     138     309   0.00
       4       7     224    1217   0.01
       8      15     104    1211   0.01
      16      31     138    3352   0.02
      32      63     337   16480   0.12
      64     127     585   55922   0.39
     128     255   53654 13487724  95.05
     256     511    1589  623688   4.40
    from      to extents  blocks    pct
       1       1     121     121   0.00
       2       3     264     597   0.00
       4       7     706    3907   0.03
       8      15     174    1802   0.01
      16      31      94    2243   0.02
      32      63     228   10806   0.08
      64     127     495   47228   0.34
     128     255   52078 13106646  93.94
     256     511    1953  779417   5.59
    from      to extents  blocks    pct
       1       1     107     107   0.00
       2       3     174     370   0.00
       4       7     248    1401   0.01
       8      15     115    1318   0.01
      16      31     111    2561   0.02
      32      63     218   10243   0.07
      64     127     443   42493   0.30
     128     255   52320 13168357  94.43
     256     511    1828  717948   5.15
    from      to extents  blocks    pct
       1       1     126     126   0.00
       2       3     353     793   0.01
       4       7     774    4297   0.03
       8      15     174    1767   0.01
      16      31     129    3135   0.02
      32      63     317   14569   0.11
      64     127     506   48326   0.35
     128     255   51507 12956078  93.58
     256     511    2055  815607   5.89
    from      to extents  blocks    pct
       1       1     118     118   0.00
       2       3     207     448   0.00
       4       7     299    1694   0.01
       8      15      91     960   0.01
      16      31     104    2394   0.02
      32      63     358   17378   0.12
      64     127     497   47351   0.34
     128     255   52540 13229046  93.84
     256     511    1971  798192   5.66
    from      to extents  blocks    pct
       1       1     105     105   0.00
       2       3     261     571   0.00
       4       7     333    1851   0.01
       8      15     100    1009   0.01
      16      31     137    3323   0.02
      32      63     261   12069   0.09
      64     127     482   45103   0.32
     128     255   51909 13060192  93.20
     256     511    2226  889345   6.35
    from      to extents  blocks    pct
       1       1     111     111   0.00
       2       3     221     471   0.00
       4       7     243    1341   0.01
       8      15     101    1002   0.01
      16      31      87    2145   0.02
      32      63     265   12987   0.09
      64     127     429   41335   0.29
     128     255   51818 13031610  92.85
     256     511    2312  944418   6.73
    from      to extents  blocks    pct
       1       1      89      89   0.00
       2       3     245     542   0.00
       4       7     383    2114   0.02
       8      15     107    1117   0.01
      16      31     153    3505   0.03
      32      63     237   11431   0.08
      64     127     489   46582   0.33
     128     255   51377 12929850  92.48
     256     511    2412  986093   7.05
    from      to extents  blocks    pct
       1       1      83      83   0.00
       2       3     253     536   0.00
       4       7     341    1902   0.01
       8      15     118    1269   0.01
      16      31     137    3201   0.02
      32      63     235   11096   0.08
      64     127     432   41041   0.30
     128     255   51165 12882960  92.73
     256     511    2348  951207   6.85
    from      to extents  blocks    pct
       1       1      63      63   0.00
       2       3     263     570   0.00
       4       7     427    2392   0.02
       8      15     143    1536   0.01
      16      31     117    2714   0.02
      32      63     217   10510   0.08
      64     127     402   38021   0.27
     128     255   50857 12803884  91.91
     256     511    2583 1071722   7.69
    from      to extents  blocks    pct
       1       1      69      69   0.00
       2       3     302     645   0.00
       4       7     343    1884   0.01
       8      15     120    1234   0.01
      16      31     133    3184   0.02
      32      63     215    9971   0.07
      64     127     506   49464   0.35
     128     255   49778 12542384  89.34
     256     511    3333 1429372  10.18
    from      to extents  blocks    pct
       1       1      62      62   0.00
       2       3     300     652   0.00
       4       7     432    2413   0.02
       8      15     173    1814   0.01
      16      31      92    2119   0.02
      32      63     253   12006   0.09
      64     127     439   43006   0.31
     128     255   49809 12539975  89.53
     256     511    3298 1403687  10.02
    from      to extents  blocks    pct
       1       1      52      52   0.00
       2       3     283     608   0.00
       4       7     253    1382   0.01
       8      15     126    1353   0.01
      16      31     117    2653   0.02
      32      63     226   10856   0.08
      64     127     462   43181   0.31
     128     255   50799 12805008  90.86
     256     511    2899 1228715   8.72
    from      to extents  blocks    pct
       1       1      53      53   0.00
       2       3     322     683   0.00
       4       7     473    2658   0.02
       8      15     206    2134   0.02
      16      31     149    3494   0.03
      32      63     251   12271   0.09
      64     127     548   52541   0.38
     128     255   50353 12685959  91.22
     256     511    2753 1146454   8.24
    from      to extents  blocks    pct
       1       1      46      46   0.00
       2       3     309     655   0.00
       4       7     373    2108   0.02
       8      15     181    1951   0.01
      16      31     161    3795   0.03
      32      63     270   12433   0.09
      64     127     434   41689   0.30
     128     255   50963 12821420  91.99
     256     511    2604 1054433   7.56
    from      to extents  blocks    pct
       1       1     121     121   0.00
       2       3     357     779   0.01
       4       7     337    1825   0.01
       8      15     220    2378   0.02
      16      31     181    4124   0.03
      32      63     297   13987   0.10
      64     127     571   53694   0.39
     128     255   49880 12560088  91.06
     256     511    2792 1155483   8.38
    from      to extents  blocks    pct
       1       1     235     235   0.00
       2       3     439     964   0.01
       4       7     448    2445   0.02
       8      15     275    2842   0.02
      16      31     221    4979   0.04
      32      63     332   15967   0.12
      64     127     596   56251   0.41
     128     255   48484 12208089  89.11
     256     511    3341 1408614  10.28
    from      to extents  blocks    pct
       1       1     163     163   0.00
       2       3     397     877   0.01
       4       7     467    2552   0.02
       8      15     275    2859   0.02
      16      31     234    5424   0.04
      32      63     336   16035   0.12
      64     127     593   55753   0.41
     128     255   49737 12515550  91.40
     256     511    2695 1093913   7.99

>> Is this a known issue? Would upgrading the kernel help?
> Not that I know of. If it's an extszhint vs free space fragmentation
> issue, then a kernel upgrade is unlikely to fix it.
>
> Cheers,
>
> Dave.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18  7:55   ` Avi Kivity
@ 2018-10-18 10:05     ` Dave Chinner
  2018-10-18 11:00       ` Avi Kivity
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-18 10:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

[ hmmm, there's some whacky utf-8 whitespace characters in the
 copy-n-pasted text... ]

On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
> 
> On 18/10/2018 04.37, Dave Chinner wrote:
> >On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> >>I have a user running a 1.7TB filesystem with ~10% usage (as shown
> >>by df), getting sporadic ENOSPC errors. The disk is mounted with
> >>inode64 and has a relatively small number of large files. The disk
> >>is a single-member RAID0 array, with 1MB chunk size. There are 32

Ok, now I need to know what "single member RAID0 array" means,
becuase this is clearly related to allocation alignment and I need
to know why the FS was configured the way it was.

It's one disk? Or is it a hardware RAID0 array that presents as a
single lun with a stripe width of 1MB? if so, how many disks aer in
it? If the chunk size the stripe unit (per disk chunk size) or the
stripe width (all disks get hit by a 1MB IO)

Or something else? 

> >>AGs. Running Linux 4.9.17.
> >ENOSPC on what operation? write? open(O_CREAT)? something else?
> 
> 
> Unknown.
> 
> 
> >What's the filesystem config (xfs_info output)?
> 
> 
> (restored from metadata dump)
> 
> 
> meta-data=/dev/loop2		isize=512 agcount=32, agsize=14494720 blks
>          =                    sectsz=512 attr=2, projid32bit=1
>          =                    crc=1 finobt=0 spinodes=0 rmapbt=0
>          =                    reflink=0
> data     =                    bsize=4096 blocks=463831040, imaxpct=5
>          =                    sunit=256 swidth=256 blks

sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
and the array only reports one number to mkfs. What this chosen by
mkfs, or specifically configured by the user? If specifically
configured, why?

What is important is that it means aligned allocations will be used
for any allocation that is over sunit (1MB) and that's where all the
problems seem to come from.

> naming   =version 2           bsize=4096 ascii-ci=0 ftype=1
> log      =internal            bsize=4096 blocks=226480, version=2
>          =                    sectsz=512 sunit=8 blks, lazy-count=1
> realtime =none                extsz=4096 blocks=0, rtextents=0
> 
> > Has xfs_fsr been run on this filesystem
> >regularly?
> 
> 
> xfs_fsr has never been run, until we saw the problem (and then did
> not fix it).  IIUC the workload should be self-defragmenting: it
> consists of writing large files, then erasing them. I estimate that
> around 100 files are written concurrently (from 14 threads), and
> they are written with large extent hints. With every large file,
> another smaller (but still large) file is written, and a few
> smallish metadata files.

Do those smaller files get removed when the big files are removed?

> I understood from xfs_fsr that it attempts to defragment files, not
> free space, although that may come as a side effect. In any case I
> ran xfs_db after xfs_fsr and did not see an improvement.

xfs_fsr takes fragmented files and contiguous free space and turns
it into contiguous files and fragmented free space. You have
fragmented free space, so I needed to know if xfs_fsr was
responsible for that....

> >If the ENOSPC errors are only from files with a 32MB extent size
> >hints on them, then it may be that there isn't sufficient contiguous
> >free space to allocate an entire 32MB extent. I'm not sure what the
> >allocator behaviour here is (the code is a maze of twisty passages),
> >so I'll have to look more into this.
> 
> There are other files with 32MB hints that do not show the error
> (but on the other hand, the error has been observed few enough times
> for that to be a fluke).

*nod*

> >In the mean time, can you post the output of the freespace command
> >(both global and per-ag) so we can see just how much free space
> >there is and how badly fragmented it has become? I might be able to
> >reproduce the behaviour if I know the conditions under which it is
> >occuring.
> 
> 
> xfs_db> freesp
>  from      to  extents    blocks    pct
>  1          1     5916      5916   0.00
>  2          3    10235     22678   0.01
>  4          7    12251     66829   0.02
>  8         15     5521     59556   0.01
>  16        31     5703    132031   0.03
>  32        63     9754    463825   0.11
>  64       127    16742   1590339   0.37
>  128      255   550511 390108625  89.87
>  256      511    71516  29178504   6.72
>  512     1023       19     15355   0.00
>  1024    2047      287    461824   0.11
>  2048    4095      528   1611413   0.37
>  4096    8191     1537  10352304   2.38
>  8192   16383        2     19015   0.00
> 
> Just 2 extents >= 32MB (and they may have been freed after the error).

Yes, and the vast majority of free space is in lengths between 512kB
and 1020kB. This is what I'd expect if you have large, stripe
aligned allocations interleaved with smaller, sub-stripe unit
allocations.

As an example of behaviour that can leads to this sort of free space
fragmentation, start with 10 stripe units of contiguous free space:

  0    1    2    3    4    5    6    7    8    9    10
  +----+----+----+----+----+----+----+----+----+----+----+

Now allocate a > stripe unit extent (say 2 units):

  0    1    2    3    4    5    6    7    8    9    10
  LLLLLLLLLL+----+----+----+----+----+----+----+----+----+

Now allocate a small file A:

  0    1    2    3    4    5    6    7    8    9    10
  LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+

Now allocate another large extent:

  0    1    2    3    4    5    6    7    8    9    10
  LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+

After a while, a significant part of your filesystem looks like
this repeating pattern:

  0    1    2    3    4    5    6    7    8    9    10
  LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+

i.e. there are lots of small, isolated sub stripe unit free spaces.
If you now start removing large extents but leaving the small
files behind, you end up with this:

  0    1    2    3    4    5    6    7    8    9    10
  LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+

And now we go to allocate a new large+small file pair (M+n)
they'll get laid out like this:

  0    1    2    3    4    5    6    7    8    9    10
  LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+

See how we lost a large aligned 2MB freespace @ 9 when the small
file "nn" was laid down? repeat this fill and free pattern over and
over again, and eventually it fragments the free space until there's
no large contiguous free spaces left, and large aligned extents can
no longer be allocated.

For this to trigger you need the small files to be larger than 1
stripe unit, but still much smaller than the extent size hint, and
the small files need to hang around as the large files come and go.

> >>Is this a known issue?

The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
issue, but I've never seen it manifest in a user workload outside of
a very constrained multistream realtime video ingest/playout
workload (i.e. the workload the filestreams allocator was written
for). And before you ask, no, the filestreams allocator does not
solve this problem.

The most common manifestation of this problem has been inode
allocation on filesystems full of small files - inodes are allocated
in large aligned extents compared to small files, and so eventually
the filesystem runs out of large contigouous freespace and inodes
can't be allocated. The sparse inodes mkfs option fixed this by
allowing inodes to be allocated as sparse chunks so they could
interleave into any free space available....

> >>Would upgrading the kernel help?
> >Not that I know of. If it's an extszhint vs free space fragmentation
> >issue, then a kernel upgrade is unlikely to fix it.

Upgrading the kernel won't fix it, because it's an extszhint vs free
space fragmentation issue.

Filesystems that get into this state are generally considered
unrecoverable.  Well, you can recover them by deleting everythign
from them to reform contiguous free space, but you may as well just
mkfs and restore from backup because it's much, much faster than
waiting for rm -rf....

And, really, I expect that a different filesystem geometry and/or
mount options are going to be needed to avoid getting into this
state again. However, I don't really know enough yet about what in
the workload and allocator is triggering to cause the issue to say
yet.

Can I get access to the metadump to dig around in the filesystem
directly so I can see how everything has ended up laid out? that
will help me work out what is actually occurring and determine if
mkfs/mount options can address the problem or whether deeper
allocator algorithm changes may be necessary....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 10:05     ` Dave Chinner
@ 2018-10-18 11:00       ` Avi Kivity
  2018-10-18 13:36         ` Avi Kivity
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 11:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 18/10/2018 13.05, Dave Chinner wrote:
> [ hmmm, there's some whacky utf-8 whitespace characters in the
>   copy-n-pasted text... ]


It's a brave new world out there.


> On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
>> On 18/10/2018 04.37, Dave Chinner wrote:
>>> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>>>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>>>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>>>> inode64 and has a relatively small number of large files. The disk
>>>> is a single-member RAID0 array, with 1MB chunk size. There are 32
> Ok, now I need to know what "single member RAID0 array" means,
> becuase this is clearly related to allocation alignment and I need
> to know why the FS was configured the way it was.


It's a Linux RAID device, /dev/md0.


We configure it this way so that it's easy to add storage (okay, the 
real reason is probably to avoid special casing one drive).


>
> It's one disk? Or is it a hardware RAID0 array that presents as a
> single lun with a stripe width of 1MB? if so, how many disks aer in
> it? If the chunk size the stripe unit (per disk chunk size) or the
> stripe width (all disks get hit by a 1MB IO)
>
> Or something else?


One disk, organized into a Linux RAID device with just one member.


>
>>>> AGs. Running Linux 4.9.17.
>>> ENOSPC on what operation? write? open(O_CREAT)? something else?
>>
>> Unknown.
>>
>>
>>> What's the filesystem config (xfs_info output)?
>>
>> (restored from metadata dump)
>>
>>
>> meta-data=/dev/loop2		isize=512 agcount=32, agsize=14494720 blks
>>           =                    sectsz=512 attr=2, projid32bit=1
>>           =                    crc=1 finobt=0 spinodes=0 rmapbt=0
>>           =                    reflink=0
>> data     =                    bsize=4096 blocks=463831040, imaxpct=5
>>           =                    sunit=256 swidth=256 blks
> sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
> and the array only reports one number to mkfs. What this chosen by
> mkfs, or specifically configured by the user? If specifically
> configured, why?


I'm guessing it's because it has one member? I'm guessing the usual is 
swidth=sunit*nmembers?


Maybe that configuration confused xfs? Although we've been using it on 
many instances.


>
> What is important is that it means aligned allocations will be used
> for any allocation that is over sunit (1MB) and that's where all the
> problems seem to come from.


Do these aligned allocations not fall back to non-aligned allocations if 
they fail?


>
>> naming   =version 2           bsize=4096 ascii-ci=0 ftype=1
>> log      =internal            bsize=4096 blocks=226480, version=2
>>           =                    sectsz=512 sunit=8 blks, lazy-count=1
>> realtime =none                extsz=4096 blocks=0, rtextents=0
>>
>>> Has xfs_fsr been run on this filesystem
>>> regularly?
>>
>> xfs_fsr has never been run, until we saw the problem (and then did
>> not fix it).  IIUC the workload should be self-defragmenting: it
>> consists of writing large files, then erasing them. I estimate that
>> around 100 files are written concurrently (from 14 threads), and
>> they are written with large extent hints. With every large file,
>> another smaller (but still large) file is written, and a few
>> smallish metadata files.
> Do those smaller files get removed when the big files are removed?


Yes. It's more or less like this:


1. Create two big files, with 32MB hints

2. Append to the two files, using 128k AIO/DIO writes. We truncate ahead 
so those writes are not size-changing.

3. Truncate those files to their final size, write ~5 much smaller files 
using the same pattern

4. A bunch of fdatasyncs, renames, and directory fdatasyncs

5. The two big files get random reads for a random while

6. All files are unlinked (with some rename and directory fdatasyncs so 
we can recover if we crash while doing that)

7. Rinse, repeat. The whole things happens in parallel for similar and 
different filesizes and lifetimes.


The commitlog files (for which we've seen the error) are simpler: create 
a file with 32MB extent hint, truncate to 32MB size, lots of writes 
(which may not all be 128k).


>
>> I understood from xfs_fsr that it attempts to defragment files, not
>> free space, although that may come as a side effect. In any case I
>> ran xfs_db after xfs_fsr and did not see an improvement.
> xfs_fsr takes fragmented files and contiguous free space and turns
> it into contiguous files and fragmented free space. You have
> fragmented free space, so I needed to know if xfs_fsr was
> responsible for that....


I see.


>
>>> If the ENOSPC errors are only from files with a 32MB extent size
>>> hints on them, then it may be that there isn't sufficient contiguous
>>> free space to allocate an entire 32MB extent. I'm not sure what the
>>> allocator behaviour here is (the code is a maze of twisty passages),
>>> so I'll have to look more into this.
>> There are other files with 32MB hints that do not show the error
>> (but on the other hand, the error has been observed few enough times
>> for that to be a fluke).
> *nod*
>
>>> In the mean time, can you post the output of the freespace command
>>> (both global and per-ag) so we can see just how much free space
>>> there is and how badly fragmented it has become? I might be able to
>>> reproduce the behaviour if I know the conditions under which it is
>>> occuring.
>>
>> xfs_db> freesp
>>   from      to  extents    blocks    pct
>>   1          1     5916      5916   0.00
>>   2          3    10235     22678   0.01
>>   4          7    12251     66829   0.02
>>   8         15     5521     59556   0.01
>>   16        31     5703    132031   0.03
>>   32        63     9754    463825   0.11
>>   64       127    16742   1590339   0.37
>>   128      255   550511 390108625  89.87
>>   256      511    71516  29178504   6.72
>>   512     1023       19     15355   0.00
>>   1024    2047      287    461824   0.11
>>   2048    4095      528   1611413   0.37
>>   4096    8191     1537  10352304   2.38
>>   8192   16383        2     19015   0.00
>>
>> Just 2 extents >= 32MB (and they may have been freed after the error).
> Yes, and the vast majority of free space is in lengths between 512kB
> and 1020kB. This is what I'd expect if you have large, stripe
> aligned allocations interleaved with smaller, sub-stripe unit
> allocations.
>
> As an example of behaviour that can leads to this sort of free space
> fragmentation, start with 10 stripe units of contiguous free space:
>
>    0    1    2    3    4    5    6    7    8    9    10
>    +----+----+----+----+----+----+----+----+----+----+----+
>
> Now allocate a > stripe unit extent (say 2 units):
>
>    0    1    2    3    4    5    6    7    8    9    10
>    LLLLLLLLLL+----+----+----+----+----+----+----+----+----+
>
> Now allocate a small file A:
>
>    0    1    2    3    4    5    6    7    8    9    10
>    LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+
>
> Now allocate another large extent:
>
>    0    1    2    3    4    5    6    7    8    9    10
>    LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+
>
> After a while, a significant part of your filesystem looks like
> this repeating pattern:
>
>    0    1    2    3    4    5    6    7    8    9    10
>    LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+
>
> i.e. there are lots of small, isolated sub stripe unit free spaces.
> If you now start removing large extents but leaving the small
> files behind, you end up with this:
>
>    0    1    2    3    4    5    6    7    8    9    10
>    LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+
>
> And now we go to allocate a new large+small file pair (M+n)
> they'll get laid out like this:
>
>    0    1    2    3    4    5    6    7    8    9    10
>    LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+
>
> See how we lost a large aligned 2MB freespace @ 9 when the small
> file "nn" was laid down? repeat this fill and free pattern over and
> over again, and eventually it fragments the free space until there's
> no large contiguous free spaces left, and large aligned extents can
> no longer be allocated.
>
> For this to trigger you need the small files to be larger than 1
> stripe unit, but still much smaller than the extent size hint, and
> the small files need to hang around as the large files come and go.


This can happen, and indeed I see our default hint is 1MB, so our small 
files use a 1MB hint. Looks like we should remove that 1MB hint since 
it's reducing allocation flexibility for XFS without a good return. On 
the other hand, I worry that because we bypass the page cache, XFS 
doesn't get to see the entire file at one time and so it will get 
fragmented.


Suppose I write a 4k file with a 1MB hint. How is that trailing (1MB-4k) 
marked? Free extent, free extent with extra annotation, or allocated 
extent? We may need to deallocate those extents? (will 
FALLOC_FL_PUNCH_HOLE do the trick?)




>
>>>> Is this a known issue?
> The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
> issue, but I've never seen it manifest in a user workload outside of
> a very constrained multistream realtime video ingest/playout
> workload (i.e. the workload the filestreams allocator was written
> for). And before you ask, no, the filestreams allocator does not
> solve this problem.
>
> The most common manifestation of this problem has been inode
> allocation on filesystems full of small files - inodes are allocated
> in large aligned extents compared to small files, and so eventually
> the filesystem runs out of large contigouous freespace and inodes
> can't be allocated. The sparse inodes mkfs option fixed this by
> allowing inodes to be allocated as sparse chunks so they could
> interleave into any free space available....


Shouldn't XFS fall back to a non-aligned allocation rather that 
returning ENOSPC on a filesystem with 90% free space?


>
>>>> Would upgrading the kernel help?
>>> Not that I know of. If it's an extszhint vs free space fragmentation
>>> issue, then a kernel upgrade is unlikely to fix it.
> Upgrading the kernel won't fix it, because it's an extszhint vs free
> space fragmentation issue.
>
> Filesystems that get into this state are generally considered
> unrecoverable.  Well, you can recover them by deleting everythign
> from them to reform contiguous free space, but you may as well just
> mkfs and restore from backup because it's much, much faster than
> waiting for rm -rf....
>
> And, really, I expect that a different filesystem geometry and/or
> mount options are going to be needed to avoid getting into this
> state again. However, I don't really know enough yet about what in
> the workload and allocator is triggering to cause the issue to say
> yet.
>
> Can I get access to the metadump to dig around in the filesystem
> directly so I can see how everything has ended up laid out? that
> will help me work out what is actually occurring and determine if
> mkfs/mount options can address the problem or whether deeper
> allocator algorithm changes may be necessary....


I will ask permission to share the dump.


Thanks a lot for all the explanations and help.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 11:00       ` Avi Kivity
@ 2018-10-18 13:36         ` Avi Kivity
  2018-10-19  7:51           ` Dave Chinner
  2018-10-18 15:44         ` Avi Kivity
  2018-10-19  1:15         ` Dave Chinner
  2 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 13:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 18/10/2018 14.00, Avi Kivity wrote:
>
>> Can I get access to the metadump to dig around in the filesystem
>> directly so I can see how everything has ended up laid out? that
>> will help me work out what is actually occurring and determine if
>> mkfs/mount options can address the problem or whether deeper
>> allocator algorithm changes may be necessary....
>
>
> I will ask permission to share the dump.
>
>
>

I'll send you a link privately.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 11:00       ` Avi Kivity
  2018-10-18 13:36         ` Avi Kivity
@ 2018-10-18 15:44         ` Avi Kivity
  2018-10-18 16:11           ` Avi Kivity
  2018-10-19  1:24           ` Dave Chinner
  2018-10-19  1:15         ` Dave Chinner
  2 siblings, 2 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 15:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 18/10/2018 14.00, Avi Kivity wrote:
>
>
> This can happen, and indeed I see our default hint is 1MB, so our 
> small files use a 1MB hint. Looks like we should remove that 1MB hint 
> since it's reducing allocation flexibility for XFS without a good return.


I convinced myself that this is the root cause, it fits perfectly with 
your explanation. I still think that XFS should allocate *something* 
rather than ENOSPC, but I can also understand someone wanting a guarantee.


> On the other hand, I worry that because we bypass the page cache, XFS 
> doesn't get to see the entire file at one time and so it will get 
> fragmented.


That's what happens. I write 1000 4k writes to 400 files, in parallel, 
AIO+DIO. I got 400 perfectly-fragmented files, each had 1000 extents.


So I'll remove the default hint for small files, and replace it with 
larger buffer sizes so we batch more and don't get 8k-sized extents 
(which is our default buffer size).


>
>
> Suppose I write a 4k file with a 1MB hint. How is that trailing 
> (1MB-4k) marked? Free extent, free extent with extra annotation, or 
> allocated extent? We may need to deallocate those extents? (will 
> FALLOC_FL_PUNCH_HOLE do the trick?)
>

I found an 11-year-old post from you that says those reservations are 
freed on close:


https://linux-xfs.oss.sgi.narkive.com/Bpctu4DN/reducing-memory-requirements-for-high-extent-xfs-files#post6


This is consistent with xfs_db reporting those areas are free.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-17  7:52 ENSOPC on a 10% used disk Avi Kivity
  2018-10-17  8:47 ` Christoph Hellwig
  2018-10-18  1:37 ` Dave Chinner
@ 2018-10-18 15:54 ` Eric Sandeen
  2018-10-21 11:49   ` Avi Kivity
  2019-02-05 21:48 ` Dave Chinner
  3 siblings, 1 reply; 26+ messages in thread
From: Eric Sandeen @ 2018-10-18 15:54 UTC (permalink / raw)
  To: Avi Kivity, linux-xfs

On 10/17/18 2:52 AM, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
> 
> 
> The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little).
> 
> 
> The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help.
> 
> 
> Is this a known issue? Would upgrading the kernel help?
> 
> 
> I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information.

It sounds like you all figured this out, but I'll drop a reference to
One Weird Trick to figure out just what function is returning a specific
error value (the example below is EINVAL)

First is my hack, what follows was Dave's refinement.  We should get this
into scripts/ some day.

> # for FUNCTION in `grep "t xfs_" /proc/kallsyms | awk '{print $3}'`; do echo "r:ret_$FUNCTION $FUNCTION \$retval" >> /sys/kernel/debug/tracing/kprobe_events; done
> 
> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 1 > $ENABLE; done
> 
> run a test that fails:
> 
> # dd if=/dev/zero of=newfile bs=513 oflag=direct
> dd: writing `newfile': Invalid argument
> 
> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 0 > $ENABLE; done
> 
> # cat /sys/kernel/debug/tracing/trace
> <snip>
>            <...>-63791 [000] d... 705435.568913: ret_xfs_vn_mknod: (xfs_vn_create+0x13/0x20 [xfs] <- xfs_vn_mknod) arg1=0
>            <...>-63791 [000] d... 705435.568913: ret_xfs_vn_create: (vfs_create+0xdb/0x100 <- xfs_vn_create) arg1=0
>            <...>-63791 [000] d... 705435.568918: ret_xfs_file_open: (do_dentry_open+0x24e/0x2e0 <- xfs_file_open) arg1=0
>            <...>-63791 [000] d... 705435.568934: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x147/0x150 [xfs] <- xfs_file_dio_aio_write) arg1=ffffffffffffffea
> 
> Hey look, it's "-22" in hex!  
> 
> so it's possible, but bleah.

Dave later refined that to:

> #!/bin/bash
> 
> TRACEDIR=/sys/kernel/debug/tracing
> 
> grep -i 't xfs_' /proc/kallsyms | awk '{print $3}' ; while read F; do
> 	echo "r:ret_$F $F \$retval" >> $TRACEDIR/kprobe_events
> done
> 
> for E in $TRACEDIR/events/kprobes/ret_xfs_*/enable; do
> 	echo 1 > $E
> done;
> 
> echo 'arg1 > 0xffffffffffffff00' > $TRACEDIR/events/kprobes/filter
> 
> for T in $TRACEDIR/events/kprobes/ret_xfs_*/trigger; do
> 	echo 'traceoff if arg1 > 0xffffffffffffff00' > $T
> done



> And that gives:
> 
> # dd if=/dev/zero of=/mnt/scratch/newfile bs=513 oflag=direct
> dd: error writing ¿/mnt/scratch/newfile¿: Invalid argument
> 1+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.000259882 s, 0.0 kB/s
> root@test4:~# cat /sys/kernel/debug/tracing/trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 1/1   #P:16
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>            <...>-8073  [006] d... 145740.460546: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x170/0x180 <- xfs_file_dio_aio_write) arg1=0xffffffffffffffea
> 
> Which is precisely the detection that XFS_ERROR would have given us.
> Ok, so I guess we can now add whatever need need to that trigger...
> 
> Basically, pass in teh XFs function names you want to trace, the
> sets up teh events, whatever trigger beahviour you want, and
> we're off to the races...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 15:44         ` Avi Kivity
@ 2018-10-18 16:11           ` Avi Kivity
  2018-10-19  1:24           ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 16:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 18/10/2018 18.44, Avi Kivity wrote:
>
> On 18/10/2018 14.00, Avi Kivity wrote:
>>
>>
>> This can happen, and indeed I see our default hint is 1MB, so our 
>> small files use a 1MB hint. Looks like we should remove that 1MB hint 
>> since it's reducing allocation flexibility for XFS without a good 
>> return.
>
>
> I convinced myself that this is the root cause, it fits perfectly with 
> your explanation. I still think that XFS should allocate *something* 
> rather than ENOSPC, but I can also understand someone wanting a 
> guarantee.
>

A small twist: there were in fact lots of small files on that system, 
caused by snapshots that the user did not remove. But I think the 
explanation still holds.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 11:00       ` Avi Kivity
  2018-10-18 13:36         ` Avi Kivity
  2018-10-18 15:44         ` Avi Kivity
@ 2018-10-19  1:15         ` Dave Chinner
  2018-10-21  9:21           ` Avi Kivity
  2 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-19  1:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote:
> On 18/10/2018 13.05, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
> >>On 18/10/2018 04.37, Dave Chinner wrote:
> >>>On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> >>>>I have a user running a 1.7TB filesystem with ~10% usage (as shown
> >>>>by df), getting sporadic ENOSPC errors. The disk is mounted with
> >>>>inode64 and has a relatively small number of large files. The disk
> >>>>is a single-member RAID0 array, with 1MB chunk size. There are 32
> >Ok, now I need to know what "single member RAID0 array" means,
> >becuase this is clearly related to allocation alignment and I need
> >to know why the FS was configured the way it was.
> 
> 
> It's a Linux RAID device, /dev/md0.
> 
> 
> We configure it this way so that it's easy to add storage (okay, the
> real reason is probably to avoid special casing one drive).

As a stripe? That requires resilvering to expand, which is a slow,
messy operation. There's also been too many horror stories about
crashes during rsilvering causing unrecoverable corruptions for my
liking...

> One disk, organized into a Linux RAID device with just one member.

So there's no realy need for IO alignment at all. Unaligned writes
to RAID0 don't require RMW cycles, so alignment is really onl used
to avoid hotspotting a disk in the stripe. Which isn't an issue
here, either.

> >>meta-data=/dev/loop2		isize=512 agcount=32, agsize=14494720 blks
> >>          =                    sectsz=512 attr=2, projid32bit=1
> >>          =                    crc=1 finobt=0 spinodes=0 rmapbt=0
> >>          =                    reflink=0
> >>data     =                    bsize=4096 blocks=463831040, imaxpct=5
> >>          =                    sunit=256 swidth=256 blks
> >sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
> >and the array only reports one number to mkfs. What this chosen by
> >mkfs, or specifically configured by the user? If specifically
> >configured, why?
> 
> 
> I'm guessing it's because it has one member? I'm guessing the usual
> is swidth=sunit*nmembers?

*nod*. Which is unusual for a RAID0 device.

> >What is important is that it means aligned allocations will be used
> >for any allocation that is over sunit (1MB) and that's where all the
> >problems seem to come from.
> 
> Do these aligned allocations not fall back to non-aligned
> allocations if they fail?

They do, but extent size hints change the fallback behaviour...

> >See how we lost a large aligned 2MB freespace @ 9 when the small
> >file "nn" was laid down? repeat this fill and free pattern over and
> >over again, and eventually it fragments the free space until there's
> >no large contiguous free spaces left, and large aligned extents can
> >no longer be allocated.
> >
> >For this to trigger you need the small files to be larger than 1
> >stripe unit, but still much smaller than the extent size hint, and
> >the small files need to hang around as the large files come and go.
> 
> 
> This can happen, and indeed I see our default hint is 1MB, so our
> small files use a 1MB hint.

Ok, which forces all allocations to be at least stripe unit (1MB)
aligned. 

>
> Looks like we should remove that 1MB
> hint since it's reducing allocation flexibility for XFS without a
> good return. On the other hand, I worry that because we bypass the
> page cache, XFS doesn't get to see the entire file at one time and
> so it will get fragmented.

Yes. Your other option is to use an extent size hint that is smaller
than the sunit. That should not align to 1MB because the initial
data allocation size is not large enough to trigger stripe
alignment.

> Suppose I write a 4k file with a 1MB hint. How is that trailing
> (1MB-4k) marked? Free extent, free extent with extra annotation, or
> allocated extent? We may need to deallocate those extents? (will
> FALLOC_FL_PUNCH_HOLE do the trick?)

It's an unwritten extent beyond EOF, and how that is treated when
the file is last closed depends on how that extent was allocated.
But, yes, punching the range beyond EOF will definitely free it.

> >>>>Is this a known issue?
> >The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
> >issue, but I've never seen it manifest in a user workload outside of
> >a very constrained multistream realtime video ingest/playout
> >workload (i.e. the workload the filestreams allocator was written
> >for). And before you ask, no, the filestreams allocator does not
> >solve this problem.
> >
> >The most common manifestation of this problem has been inode
> >allocation on filesystems full of small files - inodes are allocated
> >in large aligned extents compared to small files, and so eventually
> >the filesystem runs out of large contigouous freespace and inodes
> >can't be allocated. The sparse inodes mkfs option fixed this by
> >allowing inodes to be allocated as sparse chunks so they could
> >interleave into any free space available....
> 
> Shouldn't XFS fall back to a non-aligned allocation rather that
> returning ENOSPC on a filesystem with 90% free space?

The filesystem does fall back to unaligned allocation - there's ~5
spearate, progressively less strict allocation attempts on failure.

The problem is that the extent size hint is asking to allocate a
contiguous 32MB extent and there's no contiguous 32MB free space
extent available, aligned or not.  That's what I think is generating
the ENOSPC error, but it's not clear to me from the code whether it
is supposed to ignore the extent size hint on failure and allocate a
set of shorter unaligned extents or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 15:44         ` Avi Kivity
  2018-10-18 16:11           ` Avi Kivity
@ 2018-10-19  1:24           ` Dave Chinner
  2018-10-21  9:00             ` Avi Kivity
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-19  1:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote:
> 
> On 18/10/2018 14.00, Avi Kivity wrote:
> >
> >
> >This can happen, and indeed I see our default hint is 1MB, so our
> >small files use a 1MB hint. Looks like we should remove that 1MB
> >hint since it's reducing allocation flexibility for XFS without a
> >good return.
> 
> 
> I convinced myself that this is the root cause, it fits perfectly
> with your explanation. I still think that XFS should allocate
> *something* rather than ENOSPC, but I can also understand someone
> wanting a guarantee.

Yup, it's a classic catch 22.

> >On the other hand, I worry that because we bypass the page cache,
> >XFS doesn't get to see the entire file at one time and so it will
> >get fragmented.
> 
> 
> That's what happens. I write 1000 4k writes to 400 files, in
> parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had
> 1000 extents.

Yup, you wrote them all in the one directory, didn't you? :)

> So I'll remove the default hint for small files, and replace it with
> larger buffer sizes so we batch more and don't get 8k-sized extents
> (which is our default buffer size).

Or you could just mount with the "noalign" mount option to turn off
stripe alignment. After all, you don't need stripe alignment for a
single spindle....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 13:36         ` Avi Kivity
@ 2018-10-19  7:51           ` Dave Chinner
  2018-10-21  8:55             ` Avi Kivity
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-19  7:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
> On 18/10/2018 14.00, Avi Kivity wrote:
> >>Can I get access to the metadump to dig around in the filesystem
> >>directly so I can see how everything has ended up laid out? that
> >>will help me work out what is actually occurring and determine if
> >>mkfs/mount options can address the problem or whether deeper
> >>allocator algorithm changes may be necessary....
> >
> >I will ask permission to share the dump.
> 
> I'll send you a link privately.

Thanks - I've started looking at this - the information here is
just layout stuff - I'm omitted filenames and anything else that
might be identifying from the output.

Looking at a commit log file:

stat.size = 33554432
stat.blocks = 34720
fsxattr.xflags = 0x800 [----------e-----]
fsxattr.projid = 0
fsxattr.extsize = 33554432
fsxattr.cowextsize = 0
fsxattr.nextents = 14


and the layout:

EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
  0: [0..4079]:       2646677520..2646681599 22 (95606800..95610879)  4080 001010
  1: [4080..8159]:    2643130384..2643134463 22 (92059664..92063743)  4080 001010
  2: [8160..12239]:   2642124816..2642128895 22 (91054096..91058175)  4080 001010
  3: [12240..16319]:  2640666640..2640670719 22 (89595920..89599999)  4080 001010
  4: [16320..18367]:  2640523264..2640525311 22 (89452544..89454591)  2048 000000
  5: [18368..20415]:  2640119808..2640121855 22 (89049088..89051135)  2048 000000
  6: [20416..21287]:  2639874064..2639874935 22 (88803344..88804215)   872 001111
  7: [21288..21295]:  2639874936..2639874943 22 (88804216..88804223)     8 011111
  8: [21296..24495]:  2639874944..2639878143 22 (88804224..88807423)  3200 001010
  9: [24496..26543]:  2639427584..2639429631 22 (88356864..88358911)  2048 000000
 10: [26544..28591]:  2638981120..2638983167 22 (87910400..87912447)  2048 000000
 11: [28592..30639]:  2638770176..2638772223 22 (87699456..87701503)  2048 000000
 12: [30640..31279]:  2638247952..2638248591 22 (87177232..87177871)   640 001111
 13: [31280..34719]:  2638248592..2638252031 22 (87177872..87181311)  3440 011010
 14: [34720..65535]:  hole                                           30816

The first thing I note is the initial allocations are just short of
2MB and so the extent size hint is, indeed, being truncated here
according to contiguous free space limitations. I had thought that
should occur from reading the code, but it's complex and I wasn't
100% certain what minimum allocation length would be used.

Looking at the system batchlog files, I'm guessing the filesystem
ran out of contiguous 32MB free space extents some time around
September 25. The *Data.db files from 24 Sep and earlier then are
all nice 32MB extents, from 25 sep onwards they never make the full
32MB (30-31MB max). eg, good:

 EXT: FILE-OFFSET       BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
   0: [0..65535]:       350524552..350590087  3 (2651272..2716807)   65536 001111
   1: [65536..131071]:  353378024..353443559  3 (5504744..5570279)   65536 001111
   2: [131072..196607]: 355147016..355212551  3 (7273736..7339271)   65536 001111
   3: [196608..262143]: 360029416..360094951  3 (12156136..12221671) 65536 001111
   4: [262144..327679]: 362244144..362309679  3 (14370864..14436399) 65536 001111
   5: [327680..343415]: 365809456..365825191  3 (17936176..17951911) 15736 001111

bad:

EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
  0: [0..64127]:       512855496..512919623  4 (49024456..49088583) 64128 001111
  1: [64128..128247]:  266567048..266631167  2 (34651528..34715647) 64120 001010
  2: [128248..142327]: 264401888..264415967  2 (32486368..32500447) 14080 001111
 
Hmmm - there's 2 million files in this filesystem. that is quite a
lot...

Ok... I see where all the files are - there's a db that was
snapshotted every half hour going back to December 19 2017. There's
55GB of snapshot data there 14362 snapshots holding in 1.8million
files.

Ok, now I understand how the filesystem got into this mess. It has
nothing really to do with the filesystem allocator, geometry, extent
size hints, etc. It isn't really even an XFS specific problem - I
think most filesystems would be in trouble if you did this to them.

First, let me demonstrate that the freespace fragmentation is caused
by these snapshots by removing them all:

before:
   from      to extents  blocks    pct
      1       1    5916    5916   0.00
      2       3   10235   22678   0.01
      4       7   12251   66829   0.02
      8      15    5521   59556   0.01
     16      31    5703  132031   0.03
     32      63    9754  463825   0.11
     64     127   16742 1590339   0.37
    128     255 1550511 390108625  89.87
    256     511   71516 29178504   6.72
    512    1023      19   15355   0.00
   1024    2047     287  461824   0.11
   2048    4095     528 1611413   0.37
   4096    8191    1537 10352304   2.38
   8192   16383       2   19015   0.00

Run a delete:

for d in snapshots/*; do
	rm -rf $d &
done

<cranking along at ~12,000 write iops>

# uptime
17:41:08 up 22:07,  1 user,  load average: 14293.17, 13840.37, 9517.14
#

500,000 files removed:
   from      to extents  blocks    pct
     64     127   22564 2054234   0.47
    128     255  900480 226428059  51.43
    256     511  189904 91033237  20.68
    512    1023   68304 54958788  12.48
   1024    2047   25187 38284024   8.70
   2048    4095    5508 15204528   3.45
   4096    8191    1665 10999789   2.50
   8192   16383      15  139424   0.03

1m files removed:
  from      to extents  blocks    pct
     64     127   21940 1991685   0.45
    128     255  536985 134731402  30.35
    256     511  152092 73465972  16.55
    512    1023  100471 82971130  18.69
   1024    2047   48519 74016490  16.67
   2048    4095   17272 49209538  11.09
   4096    8191    4307 25135374   5.66
   8192   16383     135 1254037   0.28

1.5m files removed:
  from      to extents  blocks    pct
     64     127    9851  924782   0.20
    128     255  227945 57079302  12.32
    256     511   38723 18129086   3.91
    512    1023   33547 28027554   6.05
   1024    2047   31904 50171699  10.83
   2048    4095   25263 75381887  16.27
   4096    8191   16885 102836365  22.19
   8192   16383    6367 68809645  14.85
  16384   32767    1862 40183775   8.67
  32768   65535     385 16228869   3.50
  65536  131071      51 4213237   0.91
 131072  262143       6  958528   0.21

after:
  from      to extents  blocks    pct
    128     255  154063 38785829   8.64
    256     511   11037 4942114   1.10
    512    1023    8576 6930035   1.54
   1024    2047    8496 13464298   3.00
   2048    4095    7664 23034455   5.13
   4096    8191    8497 55217061  12.31
   8192   16383    4233 45867691  10.22
  16384   32767    1533 33488995   7.46
  32768   65535     520 23924895   5.33
  65536  131071     305 28675646   6.39
 131072  262143     230 42411732   9.45
 262144  524287      98 37213190   8.29
 524288 1048575      41 29163579   6.50
1048576 2097151      27 40502889   9.03
2097152 4194303       5 14576157   3.25
4194304 8388607       2 10005670   2.23

Ok, so the results is not perfect, but there are now huge contiguous
free space extents available again - ~70% of the free space is now
contiguous extents >=32MB in length. There's every chance that the
fs would confinue to help reform large contiguous free spaces as the
database files come and go now, as long as the snapshot problem is
dealt with. 

So, what's the problem? Well, it's simply that the workload is
mixing data with vastly different temporal characteristics in the
same physical locality. Every half an hour, a set of ~100 smallish
files are written into a new directory which lands them at the low
endof the largest free space extent in that AG. Each new snapshot
directory ends up in a different AG, so it slowly spreads the
snapshots across all the AGs in the filesystem.

Each snapshot effective appends to the current working area in the
AG, chopping it out of the largest contiguous free space. By the
time the next snapshot in that AG comes around, there's other new
short term data between the old snapshot and the new one. The new
snapshot chops up the largest freespace, and on goes the cycle.

Eventually the short term data between the snapshots gets removed,
but this doesn't reform large contiguous free spaces because the
snapshot data is in the way. And so this cycle continues with the
snapshot data chopping up the largest freespace extents in the
filesystem until there's not more large free space extents to be
found.

The solution is to manage the snapshot data better. We need to keep
all the long term data physically isolated from the short term data
so they don't fragment free space. A short term application level
solution would require migrating the snapshot data out of the
filesystem to somewhere else and point to it with symlinks.

>From the filesystem POV, I'm not sure that there is much we can do
about this directly - we have no idea what the lifetime of the data
is going to be....

<ding>

Hold on....

<rummage in code>

....we already have an interface so setting those sorts of hints.

fcntl(F_SET_RW_HINT, rw_hint)

/*
 * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
 * used to clear any hints previously set.
 */
#define RWF_WRITE_LIFE_NOT_SET  0
#define RWH_WRITE_LIFE_NONE     1
#define RWH_WRITE_LIFE_SHORT    2
#define RWH_WRITE_LIFE_MEDIUM   3
#define RWH_WRITE_LIFE_LONG     4
#define RWH_WRITE_LIFE_EXTREME  5

Avi, does this sound like something that you could use to
classify the different types of data the data base writes out?

I'll need to have a think about how to apply this to the allocator
policy algorithms before going any further, but I suspect making use
of this hint interface will allow us prevent interleaving of short
and long term data so avoid the freespace fragmentation it is
causing here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-19  7:51           ` Dave Chinner
@ 2018-10-21  8:55             ` Avi Kivity
  2018-10-21 14:28               ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-21  8:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 19/10/2018 10.51, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>> Can I get access to the metadump to dig around in the filesystem
>>>> directly so I can see how everything has ended up laid out? that
>>>> will help me work out what is actually occurring and determine if
>>>> mkfs/mount options can address the problem or whether deeper
>>>> allocator algorithm changes may be necessary....
>>> I will ask permission to share the dump.
>> I'll send you a link privately.
> Thanks - I've started looking at this - the information here is
> just layout stuff - I'm omitted filenames and anything else that
> might be identifying from the output.
>
> Looking at a commit log file:
>
> stat.size = 33554432
> stat.blocks = 34720
> fsxattr.xflags = 0x800 [----------e-----]
> fsxattr.projid = 0
> fsxattr.extsize = 33554432
> fsxattr.cowextsize = 0
> fsxattr.nextents = 14
>
>
> and the layout:
>
> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
>    0: [0..4079]:       2646677520..2646681599 22 (95606800..95610879)  4080 001010
>    1: [4080..8159]:    2643130384..2643134463 22 (92059664..92063743)  4080 001010
>    2: [8160..12239]:   2642124816..2642128895 22 (91054096..91058175)  4080 001010
>    3: [12240..16319]:  2640666640..2640670719 22 (89595920..89599999)  4080 001010
>    4: [16320..18367]:  2640523264..2640525311 22 (89452544..89454591)  2048 000000
>    5: [18368..20415]:  2640119808..2640121855 22 (89049088..89051135)  2048 000000
>    6: [20416..21287]:  2639874064..2639874935 22 (88803344..88804215)   872 001111
>    7: [21288..21295]:  2639874936..2639874943 22 (88804216..88804223)     8 011111
>    8: [21296..24495]:  2639874944..2639878143 22 (88804224..88807423)  3200 001010
>    9: [24496..26543]:  2639427584..2639429631 22 (88356864..88358911)  2048 000000
>   10: [26544..28591]:  2638981120..2638983167 22 (87910400..87912447)  2048 000000
>   11: [28592..30639]:  2638770176..2638772223 22 (87699456..87701503)  2048 000000
>   12: [30640..31279]:  2638247952..2638248591 22 (87177232..87177871)   640 001111
>   13: [31280..34719]:  2638248592..2638252031 22 (87177872..87181311)  3440 011010
>   14: [34720..65535]:  hole                                           30816
>
> The first thing I note is the initial allocations are just short of
> 2MB and so the extent size hint is, indeed, being truncated here
> according to contiguous free space limitations. I had thought that
> should occur from reading the code, but it's complex and I wasn't
> 100% certain what minimum allocation length would be used.
>
> Looking at the system batchlog files, I'm guessing the filesystem
> ran out of contiguous 32MB free space extents some time around
> September 25. The *Data.db files from 24 Sep and earlier then are
> all nice 32MB extents, from 25 sep onwards they never make the full
> 32MB (30-31MB max). eg, good:
>
>   EXT: FILE-OFFSET       BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
>     0: [0..65535]:       350524552..350590087  3 (2651272..2716807)   65536 001111
>     1: [65536..131071]:  353378024..353443559  3 (5504744..5570279)   65536 001111
>     2: [131072..196607]: 355147016..355212551  3 (7273736..7339271)   65536 001111
>     3: [196608..262143]: 360029416..360094951  3 (12156136..12221671) 65536 001111
>     4: [262144..327679]: 362244144..362309679  3 (14370864..14436399) 65536 001111
>     5: [327680..343415]: 365809456..365825191  3 (17936176..17951911) 15736 001111
>
> bad:
>
> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
>    0: [0..64127]:       512855496..512919623  4 (49024456..49088583) 64128 001111
>    1: [64128..128247]:  266567048..266631167  2 (34651528..34715647) 64120 001010
>    2: [128248..142327]: 264401888..264415967  2 (32486368..32500447) 14080 001111
>   


So extent size is a hint but the extent alignment is a hard requirement. 
Since eventually the ENOSPC happened due to the alignment restriction, I 
think the alignment requirement should be made a hint too.


> Hmmm - there's 2 million files in this filesystem. that is quite a
> lot...
>
> Ok... I see where all the files are - there's a db that was
> snapshotted every half hour going back to December 19 2017. There's
> 55GB of snapshot data there 14362 snapshots holding in 1.8million
> files.
>
> Ok, now I understand how the filesystem got into this mess. It has
> nothing really to do with the filesystem allocator, geometry, extent
> size hints, etc. It isn't really even an XFS specific problem - I
> think most filesystems would be in trouble if you did this to them.


Well, if you create snapshots and never delete them you'd run into a 
real ENOSPC sooner or later, so the main problem was lack of snapshot 
hygene. But it did trigger a premature ENOSPC due to the alignment 
restriction on those small files with hints (which I'm going to remove).


>
> First, let me demonstrate that the freespace fragmentation is caused
> by these snapshots by removing them all:
>
> before:
>     from      to extents  blocks    pct
>        1       1    5916    5916   0.00
>        2       3   10235   22678   0.01
>        4       7   12251   66829   0.02
>        8      15    5521   59556   0.01
>       16      31    5703  132031   0.03
>       32      63    9754  463825   0.11
>       64     127   16742 1590339   0.37
>      128     255 1550511 390108625  89.87
>      256     511   71516 29178504   6.72
>      512    1023      19   15355   0.00
>     1024    2047     287  461824   0.11
>     2048    4095     528 1611413   0.37
>     4096    8191    1537 10352304   2.38
>     8192   16383       2   19015   0.00
>
> Run a delete:
>
> for d in snapshots/*; do
> 	rm -rf $d &
> done
>
> <cranking along at ~12,000 write iops>
>
> # uptime
> 17:41:08 up 22:07,  1 user,  load average: 14293.17, 13840.37, 9517.14
> #
>
> 500,000 files removed:
>     from      to extents  blocks    pct
>       64     127   22564 2054234   0.47
>      128     255  900480 226428059  51.43
>      256     511  189904 91033237  20.68
>      512    1023   68304 54958788  12.48
>     1024    2047   25187 38284024   8.70
>     2048    4095    5508 15204528   3.45
>     4096    8191    1665 10999789   2.50
>     8192   16383      15  139424   0.03
>
> 1m files removed:
>    from      to extents  blocks    pct
>       64     127   21940 1991685   0.45
>      128     255  536985 134731402  30.35
>      256     511  152092 73465972  16.55
>      512    1023  100471 82971130  18.69
>     1024    2047   48519 74016490  16.67
>     2048    4095   17272 49209538  11.09
>     4096    8191    4307 25135374   5.66
>     8192   16383     135 1254037   0.28
>
> 1.5m files removed:
>    from      to extents  blocks    pct
>       64     127    9851  924782   0.20
>      128     255  227945 57079302  12.32
>      256     511   38723 18129086   3.91
>      512    1023   33547 28027554   6.05
>     1024    2047   31904 50171699  10.83
>     2048    4095   25263 75381887  16.27
>     4096    8191   16885 102836365  22.19
>     8192   16383    6367 68809645  14.85
>    16384   32767    1862 40183775   8.67
>    32768   65535     385 16228869   3.50
>    65536  131071      51 4213237   0.91
>   131072  262143       6  958528   0.21
>
> after:
>    from      to extents  blocks    pct
>      128     255  154063 38785829   8.64
>      256     511   11037 4942114   1.10
>      512    1023    8576 6930035   1.54
>     1024    2047    8496 13464298   3.00
>     2048    4095    7664 23034455   5.13
>     4096    8191    8497 55217061  12.31
>     8192   16383    4233 45867691  10.22
>    16384   32767    1533 33488995   7.46
>    32768   65535     520 23924895   5.33
>    65536  131071     305 28675646   6.39
>   131072  262143     230 42411732   9.45
>   262144  524287      98 37213190   8.29
>   524288 1048575      41 29163579   6.50
> 1048576 2097151      27 40502889   9.03
> 2097152 4194303       5 14576157   3.25
> 4194304 8388607       2 10005670   2.23
>
> Ok, so the results is not perfect, but there are now huge contiguous
> free space extents available again - ~70% of the free space is now
> contiguous extents >=32MB in length. There's every chance that the
> fs would confinue to help reform large contiguous free spaces as the
> database files come and go now, as long as the snapshot problem is
> dealt with.
>
> So, what's the problem? Well, it's simply that the workload is
> mixing data with vastly different temporal characteristics in the
> same physical locality. Every half an hour, a set of ~100 smallish
> files are written into a new directory which lands them at the low
> endof the largest free space extent in that AG. Each new snapshot
> directory ends up in a different AG, so it slowly spreads the
> snapshots across all the AGs in the filesystem.


Not exactly - those snapshots are hard links into the live database 
files, which eventually get removed. Usually, small files get removed 
early, but with the snapshots they get to live forever.


> Each snapshot effective appends to the current working area in the
> AG, chopping it out of the largest contiguous free space. By the
> time the next snapshot in that AG comes around, there's other new
> short term data between the old snapshot and the new one. The new
> snapshot chops up the largest freespace, and on goes the cycle.
>
> Eventually the short term data between the snapshots gets removed,
> but this doesn't reform large contiguous free spaces because the
> snapshot data is in the way. And so this cycle continues with the
> snapshot data chopping up the largest freespace extents in the
> filesystem until there's not more large free space extents to be
> found.
>
> The solution is to manage the snapshot data better. We need to keep
> all the long term data physically isolated from the short term data
> so they don't fragment free space. A short term application level
> solution would require migrating the snapshot data out of the
> filesystem to somewhere else and point to it with symlinks.


Snapshots should not live forever on the disk. The procedure is to 
create a snapshot, copy it away, and then delete the snapshot. It's okay 
to let snapshots live for a while, but not all of them and not without a 
bound on their lifetime.


The filesystem did have a role in this, by requiring alignment of the 
extent to the RAID stripe size. Now, given that this was a RAID with one 
member, alignment is pointless, but most of our deployments are to RAID 
arrays with >1 members, and alignment does save 12.5% of IOPS compared 
to un-aligned extents for compactions and writes (our scans/writes use 
128k buffers, and the alignment is to 1MB). The database caused the 
problem by indirectly requiring 1MB alignment for files that are much 
smaller than 1MB, and the user contributed to the problem by causing 
millions of such small files to be kept.


>
>  From the filesystem POV, I'm not sure that there is much we can do
> about this directly - we have no idea what the lifetime of the data
> is going to be....
>
> <ding>
>
> Hold on....
>
> <rummage in code>
>
> ....we already have an interface so setting those sorts of hints.
>
> fcntl(F_SET_RW_HINT, rw_hint)
>
> /*
>   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
>   * used to clear any hints previously set.
>   */
> #define RWF_WRITE_LIFE_NOT_SET  0
> #define RWH_WRITE_LIFE_NONE     1
> #define RWH_WRITE_LIFE_SHORT    2
> #define RWH_WRITE_LIFE_MEDIUM   3
> #define RWH_WRITE_LIFE_LONG     4
> #define RWH_WRITE_LIFE_EXTREME  5
>
> Avi, does this sound like something that you could use to
> classify the different types of data the data base writes out?


So long as the penalty for a mis-classification is not too large, we can 
for sure. Commitlog files have short lifespan, and so do newly born 
small data files. Those small data files are compacted into increasingly 
larger and long-lived files, and this information is known at the time 
of creation.


Even without the filesystem altering its allocation according to the 
hint, this is still useful, since the disk will alter its internal 
allocation and maybe do something useful with it (as long as the 
filesystem passes the hint to the disk).


>
> I'll need to have a think about how to apply this to the allocator
> policy algorithms before going any further, but I suspect making use
> of this hint interface will allow us prevent interleaving of short
> and long term data so avoid the freespace fragmentation it is
> causing here....


IIUC, the problem (of having ENOSPC on a 10% used disk) is not 
fragmentation per se, it's the alignment requirement. To take it to 
extreme, a 1TB disk can only hold a million files if those files must be 
aligned to 1MB, even if everything is perfectly laid out. For sure 
fragmentation would have degraded performance sooner or later, but 
that's not as bad as that ENOSPC.


I'm addressing the ENOSPC by removing the extent allocation hint on 
files that are known small (and increasing their application buffer 
sizes). In fact that will increase fragmentation as the filesystem will 
allocate on extent per buffer, rather than one extent for the entire 
file. But I think that, given that the extent size is treated as a hint 
(or so I infer from the fact that we have <32MB extents), so should the 
alignment. Perhaps allocation with a hint should be performed in two 
passes, first trying to match size and alignment, and second relaxing 
both restrictions.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-19  1:24           ` Dave Chinner
@ 2018-10-21  9:00             ` Avi Kivity
  2018-10-21 14:34               ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-21  9:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 19/10/2018 04.24, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote:
>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>
>>> This can happen, and indeed I see our default hint is 1MB, so our
>>> small files use a 1MB hint. Looks like we should remove that 1MB
>>> hint since it's reducing allocation flexibility for XFS without a
>>> good return.
>>
>> I convinced myself that this is the root cause, it fits perfectly
>> with your explanation. I still think that XFS should allocate
>> *something* rather than ENOSPC, but I can also understand someone
>> wanting a guarantee.
> Yup, it's a classic catch 22.
>
>>> On the other hand, I worry that because we bypass the page cache,
>>> XFS doesn't get to see the entire file at one time and so it will
>>> get fragmented.
>>
>> That's what happens. I write 1000 4k writes to 400 files, in
>> parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had
>> 1000 extents.
> Yup, you wrote them all in the one directory, didn't you? :)


Yes :(


But if I have more concurrently-written files than AGs, I'd get the same 
behavior with multiple directories, no?


>> So I'll remove the default hint for small files, and replace it with
>> larger buffer sizes so we batch more and don't get 8k-sized extents
>> (which is our default buffer size).
> Or you could just mount with the "noalign" mount option to turn off
> stripe alignment. After all, you don't need stripe alignment for a
> single spindle....


For a single spindle, sure. But most deployments have multiple spindles.


Since these aren't real spindles, the advantages of alignment are not as 
great, but they still exist. The files are written with aligned offsets, 
and some of the reads are also aligned, so it saves IOPS whenever we 
cross an alignment boundary.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-19  1:15         ` Dave Chinner
@ 2018-10-21  9:21           ` Avi Kivity
  2018-10-21 15:06             ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-21  9:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 19/10/2018 04.15, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote:
>> On 18/10/2018 13.05, Dave Chinner wrote:
>>> On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
>>>> On 18/10/2018 04.37, Dave Chinner wrote:
>>>>> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>>>>>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>>>>>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>>>>>> inode64 and has a relatively small number of large files. The disk
>>>>>> is a single-member RAID0 array, with 1MB chunk size. There are 32
>>> Ok, now I need to know what "single member RAID0 array" means,
>>> becuase this is clearly related to allocation alignment and I need
>>> to know why the FS was configured the way it was.
>>
>> It's a Linux RAID device, /dev/md0.
>>
>>
>> We configure it this way so that it's easy to add storage (okay, the
>> real reason is probably to avoid special casing one drive).
> As a stripe? That requires resilvering to expand, which is a slow,
> messy operation. There's also been too many horror stories about
> crashes during rsilvering causing unrecoverable corruptions for my
> liking...


Like I said, the real reason is to avoid a special case for one disk. I 
don't think we, or one of our users, ever expanded a RAID array in this way.


>
>> One disk, organized into a Linux RAID device with just one member.
> So there's no realy need for IO alignment at all. Unaligned writes
> to RAID0 don't require RMW cycles, so alignment is really onl used
> to avoid hotspotting a disk in the stripe. Which isn't an issue
> here, either.


It does help (for >1 member arrays) in avoiding a logically aligned read 
or write to be split into two ops targeting two disks.


>>>> meta-data=/dev/loop2		isize=512 agcount=32, agsize=14494720 blks
>>>>           =                    sectsz=512 attr=2, projid32bit=1
>>>>           =                    crc=1 finobt=0 spinodes=0 rmapbt=0
>>>>           =                    reflink=0
>>>> data     =                    bsize=4096 blocks=463831040, imaxpct=5
>>>>           =                    sunit=256 swidth=256 blks
>>> sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
>>> and the array only reports one number to mkfs. What this chosen by
>>> mkfs, or specifically configured by the user? If specifically
>>> configured, why?
>>
>> I'm guessing it's because it has one member? I'm guessing the usual
>> is swidth=sunit*nmembers?
> *nod*. Which is unusual for a RAID0 device.
>
>>> What is important is that it means aligned allocations will be used
>>> for any allocation that is over sunit (1MB) and that's where all the
>>> problems seem to come from.
>> Do these aligned allocations not fall back to non-aligned
>> allocations if they fail?
> They do, but extent size hints change the fallback behaviour...
>
>>> See how we lost a large aligned 2MB freespace @ 9 when the small
>>> file "nn" was laid down? repeat this fill and free pattern over and
>>> over again, and eventually it fragments the free space until there's
>>> no large contiguous free spaces left, and large aligned extents can
>>> no longer be allocated.
>>>
>>> For this to trigger you need the small files to be larger than 1
>>> stripe unit, but still much smaller than the extent size hint, and
>>> the small files need to hang around as the large files come and go.
>>
>> This can happen, and indeed I see our default hint is 1MB, so our
>> small files use a 1MB hint.
> Ok, which forces all allocations to be at least stripe unit (1MB)
> aligned.


If the hint were smaller than the stripe unit, would it remove the 
alignment requirement? I see you answered below.




>> Looks like we should remove that 1MB
>> hint since it's reducing allocation flexibility for XFS without a
>> good return. On the other hand, I worry that because we bypass the
>> page cache, XFS doesn't get to see the entire file at one time and
>> so it will get fragmented.
> Yes. Your other option is to use an extent size hint that is smaller
> than the sunit. That should not align to 1MB because the initial
> data allocation size is not large enough to trigger stripe
> alignment.


Wow, so we had so many  factors leading to this:

- 1-disk installations arranged as RAID0 even though not strictly needed

- having a default extent allocation hint, even for small files

- having that default hint be >= the stripe unit size

- the user not removing snapshots

- XFS not falling back to unaligned allocations


>> Suppose I write a 4k file with a 1MB hint. How is that trailing
>> (1MB-4k) marked? Free extent, free extent with extra annotation, or
>> allocated extent? We may need to deallocate those extents? (will
>> FALLOC_FL_PUNCH_HOLE do the trick?)
> It's an unwritten extent beyond EOF, and how that is treated when
> the file is last closed depends on how that extent was allocated.
> But, yes, punching the range beyond EOF will definitely free it.


I think we can conclude from the dump that the filesystem freed it?


>>>>>> Is this a known issue?
>>> The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
>>> issue, but I've never seen it manifest in a user workload outside of
>>> a very constrained multistream realtime video ingest/playout
>>> workload (i.e. the workload the filestreams allocator was written
>>> for). And before you ask, no, the filestreams allocator does not
>>> solve this problem.
>>>
>>> The most common manifestation of this problem has been inode
>>> allocation on filesystems full of small files - inodes are allocated
>>> in large aligned extents compared to small files, and so eventually
>>> the filesystem runs out of large contigouous freespace and inodes
>>> can't be allocated. The sparse inodes mkfs option fixed this by
>>> allowing inodes to be allocated as sparse chunks so they could
>>> interleave into any free space available....
>> Shouldn't XFS fall back to a non-aligned allocation rather that
>> returning ENOSPC on a filesystem with 90% free space?
> The filesystem does fall back to unaligned allocation - there's ~5
> spearate, progressively less strict allocation attempts on failure.
>
> The problem is that the extent size hint is asking to allocate a
> contiguous 32MB extent and there's no contiguous 32MB free space
> extent available, aligned or not.  That's what I think is generating
> the ENOSPC error, but it's not clear to me from the code whether it
> is supposed to ignore the extent size hint on failure and allocate a
> set of shorter unaligned extents or not....


Here's a file from the dump:


  ext:     logical_offset:        physical_offset: length: expected: flags:
    0:        0..    1eb2:    3928e00..   392acb2:   1eb3:
    1:     1eb3..    3cb2:    3c91200..   3c92fff:   1e00: 392acb3:
    2:     3cb3..    57b2:    3454100..   3455bff:   1b00: 3c93000:
    3:     57b3..    6fb2:    34ecd00..   34ee4ff:   1800: 3455c00:
    4:     6fb3..    85fe:    3386a00..   338804b:   164c: 34ee500:
    5:     85ff..    9c0b:    2c85c00..   2c8720c:   160d: 338804c:
    6:     9c0c..    b217:    3099900..   309af0b:   160c: 2c8720d:
    7:     b218..    c823:    34fb300..   34fc90b:   160c: 309af0c:
    8:     c824..    de2b:    315ef00..   3160507:   1608: 34fc90c:
    9:     de2c..    f42f:    36adc00..   36af203:   1604: 3160508:
   10:     f430..   10a30:    2cf4400..   2cf5a00:   1601: 36af204:
   11:    10a31..   12030:    2e03300..   2e048ff:   1600: 2cf5a01:
   12:    12031..   13630:    2ff5200..   2ff67ff:   1600: 2e04900:
   13:    13631..   14c30:    3199e00..   319b3ff:   1600: 2ff6800:
   14:    14c31..   16230:    32ed500..   32eeaff:   1600: 319b400:
   15:    16231..   17830:    34a0b00..   34a20ff:   1600: 32eeb00:
   16:    17831..   18e30:    354e700..   354fcff:   1600: 34a2100:
   17:    18e31..   1a430:    362c400..   362d9ff:   1600: 354fd00:
   18:    1a431..   1ba1d:    3192b00..   31940ec:   15ed: 362da00:
   19:    1ba1e..   1d05c:    4228500..   4229b3e:   163f: 31940ed:
   20:    1d05d..   1e692:    3f6c900..   3f6df35:   1636: 4229b3f:
   21:    1e693..   1fcc0:    37d4400..   37d5a2d:   162e: 3f6df36:
   22:    1fcc1..   212e4:    43f9c00..   43fb223:   1624: 37d5a2e:
   23:    212e5..   22905:    4003500..   4004b20:   1621: 43fb224:
   24:    22906..   23803:    1fdb900..   1fdc7fd:    efe: 4004b21: last,eof


So, lengths are not always aligned, but physical_offset always is. So 
XFS relaxes the extent size hint but not alignment.


It looks like XFS allocates one extent and moves on, not trying to 
allocate all the way to the 32MB hint size. If that were the case, we'd 
see logical_offset restore alignment every 32MB.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-18 15:54 ` Eric Sandeen
@ 2018-10-21 11:49   ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-21 11:49 UTC (permalink / raw)
  To: Eric Sandeen, linux-xfs


On 18/10/2018 18.54, Eric Sandeen wrote:
> On 10/17/18 2:52 AM, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
>>
>>
>> The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little).
>>
>>
>> The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help.
>>
>>
>> Is this a known issue? Would upgrading the kernel help?
>>
>>
>> I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information.
> It sounds like you all figured this out, but I'll drop a reference to
> One Weird Trick to figure out just what function is returning a specific
> error value (the example below is EINVAL)
>
> First is my hack, what follows was Dave's refinement.  We should get this
> into scripts/ some day.


Cool, although to get noticed these days you have to put in bpf 
somewhere (and probably it can help with some kernel-side filtering - 
start logging as soon as you see the error, and hopefully you can 
recover the path from the returns).


>> # for FUNCTION in `grep "t xfs_" /proc/kallsyms | awk '{print $3}'`; do echo "r:ret_$FUNCTION $FUNCTION \$retval" >> /sys/kernel/debug/tracing/kprobe_events; done
>>
>> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 1 > $ENABLE; done
>>
>> run a test that fails:
>>
>> # dd if=/dev/zero of=newfile bs=513 oflag=direct
>> dd: writing `newfile': Invalid argument
>>
>> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 0 > $ENABLE; done
>>
>> # cat /sys/kernel/debug/tracing/trace
>> <snip>
>>             <...>-63791 [000] d... 705435.568913: ret_xfs_vn_mknod: (xfs_vn_create+0x13/0x20 [xfs] <- xfs_vn_mknod) arg1=0
>>             <...>-63791 [000] d... 705435.568913: ret_xfs_vn_create: (vfs_create+0xdb/0x100 <- xfs_vn_create) arg1=0
>>             <...>-63791 [000] d... 705435.568918: ret_xfs_file_open: (do_dentry_open+0x24e/0x2e0 <- xfs_file_open) arg1=0
>>             <...>-63791 [000] d... 705435.568934: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x147/0x150 [xfs] <- xfs_file_dio_aio_write) arg1=ffffffffffffffea
>>
>> Hey look, it's "-22" in hex!
>>
>> so it's possible, but bleah.
> Dave later refined that to:
>
>> #!/bin/bash
>>
>> TRACEDIR=/sys/kernel/debug/tracing
>>
>> grep -i 't xfs_' /proc/kallsyms | awk '{print $3}' ; while read F; do
>> 	echo "r:ret_$F $F \$retval" >> $TRACEDIR/kprobe_events
>> done
>>
>> for E in $TRACEDIR/events/kprobes/ret_xfs_*/enable; do
>> 	echo 1 > $E
>> done;
>>
>> echo 'arg1 > 0xffffffffffffff00' > $TRACEDIR/events/kprobes/filter
>>
>> for T in $TRACEDIR/events/kprobes/ret_xfs_*/trigger; do
>> 	echo 'traceoff if arg1 > 0xffffffffffffff00' > $T
>> done
>
>
>> And that gives:
>>
>> # dd if=/dev/zero of=/mnt/scratch/newfile bs=513 oflag=direct
>> dd: error writing ¿/mnt/scratch/newfile¿: Invalid argument
>> 1+0 records in
>> 0+0 records out
>> 0 bytes (0 B) copied, 0.000259882 s, 0.0 kB/s
>> root@test4:~# cat /sys/kernel/debug/tracing/trace
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 1/1   #P:16
>> #
>> #                              _-----=> irqs-off
>> #                             / _----=> need-resched
>> #                            | / _---=> hardirq/softirq
>> #                            || / _--=> preempt-depth
>> #                            ||| /     delay
>> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
>> #              | |       |   ||||       |         |
>>             <...>-8073  [006] d... 145740.460546: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x170/0x180 <- xfs_file_dio_aio_write) arg1=0xffffffffffffffea
>>
>> Which is precisely the detection that XFS_ERROR would have given us.
>> Ok, so I guess we can now add whatever need need to that trigger...
>>
>> Basically, pass in teh XFs function names you want to trace, the
>> sets up teh events, whatever trigger beahviour you want, and
>> we're off to the races...
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-21  8:55             ` Avi Kivity
@ 2018-10-21 14:28               ` Dave Chinner
  2018-10-22  8:35                 ` Avi Kivity
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-21 14:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote:
> 
> On 19/10/2018 10.51, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
> >>On 18/10/2018 14.00, Avi Kivity wrote:
> >>>>Can I get access to the metadump to dig around in the filesystem
> >>>>directly so I can see how everything has ended up laid out? that
> >>>>will help me work out what is actually occurring and determine if
> >>>>mkfs/mount options can address the problem or whether deeper
> >>>>allocator algorithm changes may be necessary....
> >>>I will ask permission to share the dump.
> >>I'll send you a link privately.
> >Thanks - I've started looking at this - the information here is
> >just layout stuff - I'm omitted filenames and anything else that
> >might be identifying from the output.
> >
> >Looking at a commit log file:
> >
> >stat.size = 33554432
> >stat.blocks = 34720
> >fsxattr.xflags = 0x800 [----------e-----]
> >fsxattr.projid = 0
> >fsxattr.extsize = 33554432
> >fsxattr.cowextsize = 0
> >fsxattr.nextents = 14
> >
> >
> >and the layout:
> >
> >EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
> >   0: [0..4079]:       2646677520..2646681599 22 (95606800..95610879)  4080 001010
> >   1: [4080..8159]:    2643130384..2643134463 22 (92059664..92063743)  4080 001010
> >   2: [8160..12239]:   2642124816..2642128895 22 (91054096..91058175)  4080 001010
> >   3: [12240..16319]:  2640666640..2640670719 22 (89595920..89599999)  4080 001010
> >   4: [16320..18367]:  2640523264..2640525311 22 (89452544..89454591)  2048 000000
> >   5: [18368..20415]:  2640119808..2640121855 22 (89049088..89051135)  2048 000000
> >   6: [20416..21287]:  2639874064..2639874935 22 (88803344..88804215)   872 001111
> >   7: [21288..21295]:  2639874936..2639874943 22 (88804216..88804223)     8 011111
> >   8: [21296..24495]:  2639874944..2639878143 22 (88804224..88807423)  3200 001010
> >   9: [24496..26543]:  2639427584..2639429631 22 (88356864..88358911)  2048 000000
> >  10: [26544..28591]:  2638981120..2638983167 22 (87910400..87912447)  2048 000000
> >  11: [28592..30639]:  2638770176..2638772223 22 (87699456..87701503)  2048 000000
> >  12: [30640..31279]:  2638247952..2638248591 22 (87177232..87177871)   640 001111
> >  13: [31280..34719]:  2638248592..2638252031 22 (87177872..87181311)  3440 011010
> >  14: [34720..65535]:  hole                                           30816
> >
> >The first thing I note is the initial allocations are just short of
> >2MB and so the extent size hint is, indeed, being truncated here
> >according to contiguous free space limitations. I had thought that
> >should occur from reading the code, but it's complex and I wasn't
> >100% certain what minimum allocation length would be used.
> >
> >Looking at the system batchlog files, I'm guessing the filesystem
> >ran out of contiguous 32MB free space extents some time around
> >September 25. The *Data.db files from 24 Sep and earlier then are
> >all nice 32MB extents, from 25 sep onwards they never make the full
> >32MB (30-31MB max). eg, good:
> >
> >  EXT: FILE-OFFSET       BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
> >    0: [0..65535]:       350524552..350590087  3 (2651272..2716807)   65536 001111
> >    1: [65536..131071]:  353378024..353443559  3 (5504744..5570279)   65536 001111
> >    2: [131072..196607]: 355147016..355212551  3 (7273736..7339271)   65536 001111
> >    3: [196608..262143]: 360029416..360094951  3 (12156136..12221671) 65536 001111
> >    4: [262144..327679]: 362244144..362309679  3 (14370864..14436399) 65536 001111
> >    5: [327680..343415]: 365809456..365825191  3 (17936176..17951911) 15736 001111
> >
> >bad:
> >
> >EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
> >   0: [0..64127]:       512855496..512919623  4 (49024456..49088583) 64128 001111
> >   1: [64128..128247]:  266567048..266631167  2 (34651528..34715647) 64120 001010
> >   2: [128248..142327]: 264401888..264415967  2 (32486368..32500447) 14080 001111
> 
> 
> So extent size is a hint but the extent alignment is a hard
> requirement.

No, physical alignment is being ignored here, too. THose flags on
the end?

 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width

When you have 001111, the allocation was completely unaligned.
When you have 001010, the tail is stripe aligned
When you ahve 000000, the head and tail are stripe aligned

As you can see, there is a mix of aligned, tail aligned and
completely unaligned extents.

So, no, XFS is droping both size hints and alignment hints when
it starts running out of aligned contiguous free space extents.

> >Ok, so the results is not perfect, but there are now huge contiguous
> >free space extents available again - ~70% of the free space is now
> >contiguous extents >=32MB in length. There's every chance that the
> >fs would confinue to help reform large contiguous free spaces as the
> >database files come and go now, as long as the snapshot problem is
> >dealt with.
> >
> >So, what's the problem? Well, it's simply that the workload is
> >mixing data with vastly different temporal characteristics in the
> >same physical locality. Every half an hour, a set of ~100 smallish
> >files are written into a new directory which lands them at the low
> >endof the largest free space extent in that AG. Each new snapshot
> >directory ends up in a different AG, so it slowly spreads the
> >snapshots across all the AGs in the filesystem.
> 
> 
> Not exactly - those snapshots are hard links into the live database
> files, which eventually get removed. Usually, small files get
> removed early, but with the snapshots they get to live forever.

They might be created as hard links, but the effect when the
orginal database file links are removed is the same - the snapshotted
data lives forever, interleaved amongst short term data.

> >Each snapshot effective appends to the current working area in the
> >AG, chopping it out of the largest contiguous free space. By the
> >time the next snapshot in that AG comes around, there's other new
> >short term data between the old snapshot and the new one. The new
> >snapshot chops up the largest freespace, and on goes the cycle.
> >
> >Eventually the short term data between the snapshots gets removed,
> >but this doesn't reform large contiguous free spaces because the
> >snapshot data is in the way. And so this cycle continues with the
> >snapshot data chopping up the largest freespace extents in the
> >filesystem until there's not more large free space extents to be
> >found.
> >
> >The solution is to manage the snapshot data better. We need to keep
> >all the long term data physically isolated from the short term data
> >so they don't fragment free space. A short term application level
> >solution would require migrating the snapshot data out of the
> >filesystem to somewhere else and point to it with symlinks.
> 
> 
> Snapshots should not live forever on the disk. The procedure is to
> create a snapshot, copy it away, and then delete the snapshot. It's
> okay to let snapshots live for a while, but not all of them and not
> without a bound on their lifetime.
> 
> 
> The filesystem did have a role in this, by requiring alignment of
> the extent to the RAID stripe size.

No, in the end it didn't.

> Now, given that this was a RAID
> with one member, alignment is pointless, but most of our deployments
> are to RAID arrays with >1 members, and alignment does save 12.5% of
> IOPS compared to un-aligned extents for compactions and writes (our
> scans/writes use 128k buffers, and the alignment is to 1MB). The
> database caused the problem by indirectly requiring 1MB alignment
> for files that are much smaller than 1MB, and the user contributed
> to the problem by causing millions of such small files to be kept.

*nod*

> >
> ><ding>
> >
> >Hold on....
> >
> ><rummage in code>
> >
> >....we already have an interface so setting those sorts of hints.
> >
> >fcntl(F_SET_RW_HINT, rw_hint)
> >
> >/*
> >  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
> >  * used to clear any hints previously set.
> >  */
> >#define RWF_WRITE_LIFE_NOT_SET  0
> >#define RWH_WRITE_LIFE_NONE     1
> >#define RWH_WRITE_LIFE_SHORT    2
> >#define RWH_WRITE_LIFE_MEDIUM   3
> >#define RWH_WRITE_LIFE_LONG     4
> >#define RWH_WRITE_LIFE_EXTREME  5
> >
> >Avi, does this sound like something that you could use to
> >classify the different types of data the data base writes out?
> 
> 
> So long as the penalty for a mis-classification is not too large, we
> can for sure.

OK.

> >I'll need to have a think about how to apply this to the allocator
> >policy algorithms before going any further, but I suspect making use
> >of this hint interface will allow us prevent interleaving of short
> >and long term data so avoid the freespace fragmentation it is
> >causing here....
> 
> 
> IIUC, the problem (of having ENOSPC on a 10% used disk) is not
> fragmentation per se, it's the alignment requirement.

Which, as I've noted above, alignment is a hint, not a requirement.

> To take it to
> extreme, a 1TB disk can only hold a million files if those files
> must be aligned to 1MB, even if everything is perfectly laid out.
> For sure fragmentation would have degraded performance sooner or
> later, but that's not as bad as that ENOSPC.

What it comes down to is that having looked into it, I don't know
why that ENOSPC error occurred.

Alignment didn't cause it because alignment was being dropped - that
just caused free space fragmentation.  Extent size hints didn't
cause it because the size hints were dropped - that just caused
freespace fragmentation. A lack of free space
didn't cause it, because there was heaps of free space in all
allocation groups.

But something tickled a corner case that triggered an allocation
failure that was interpretted as ENOSPC rather than retrying the
allocation.  Until I can reproduce the ENOSPC allocation failure
(and I tried!) then it'll be a mystery as to what caused it.

> entire file. But I think that, given that the extent size is treated
> as a hint (or so I infer from the fact that we have <32MB extents),
> so should the alignment. Perhaps allocation with a hint should be
> performed in two passes, first trying to match size and alignment,
> and second relaxing both restrictions.

I think I already mentioned there were 5 separate attmepts to
allocate, each failure reducing restrictions:

1. extent sized and contiguous to adjacent block in file
2. extent sized and aligned, at higher block in AG
3. extent sized, not aligned, at higher block in AG
4. >= minimum length, not aligned, anywhere in AG >= target AG 
5. minimum length, not aligned, in any AG

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-21  9:00             ` Avi Kivity
@ 2018-10-21 14:34               ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2018-10-21 14:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Sun, Oct 21, 2018 at 12:00:16PM +0300, Avi Kivity wrote:
> 
> On 19/10/2018 04.24, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote:
> >>On 18/10/2018 14.00, Avi Kivity wrote:
> >>>
> >>>This can happen, and indeed I see our default hint is 1MB, so our
> >>>small files use a 1MB hint. Looks like we should remove that 1MB
> >>>hint since it's reducing allocation flexibility for XFS without a
> >>>good return.
> >>
> >>I convinced myself that this is the root cause, it fits perfectly
> >>with your explanation. I still think that XFS should allocate
> >>*something* rather than ENOSPC, but I can also understand someone
> >>wanting a guarantee.
> >Yup, it's a classic catch 22.
> >
> >>>On the other hand, I worry that because we bypass the page cache,
> >>>XFS doesn't get to see the entire file at one time and so it will
> >>>get fragmented.
> >>
> >>That's what happens. I write 1000 4k writes to 400 files, in
> >>parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had
> >>1000 extents.
> >Yup, you wrote them all in the one directory, didn't you? :)
> 
> 
> Yes :(
> 
> But if I have more concurrently-written files than AGs, I'd get the
> same behavior with multiple directories, no?

Up to a point. At which point, I'd say you're doing it wrong and
tell you to use extent size hints or buffered IO so the filesystem
can turn the small random writes in nicely formed large IOs via
delayed allocation. :)

Remember the first rule of storage: Garbage In, Garbage Out.

With direct IO, it's the responsibility of the application to give
the fileystem and storage layers well formed IOs. If the app doesn't
play nice, there's nothing the filesystem or storage layers can do
to make it better....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-21  9:21           ` Avi Kivity
@ 2018-10-21 15:06             ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2018-10-21 15:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Sun, Oct 21, 2018 at 12:21:33PM +0300, Avi Kivity wrote:
> 
> On 19/10/2018 04.15, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote:
> >>On 18/10/2018 13.05, Dave Chinner wrote:
> >>>On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
> >>>>On 18/10/2018 04.37, Dave Chinner wrote:

> >>Looks like we should remove that 1MB
> >>hint since it's reducing allocation flexibility for XFS without a
> >>good return. On the other hand, I worry that because we bypass the
> >>page cache, XFS doesn't get to see the entire file at one time and
> >>so it will get fragmented.
> >Yes. Your other option is to use an extent size hint that is smaller
> >than the sunit. That should not align to 1MB because the initial
> >data allocation size is not large enough to trigger stripe
> >alignment.
> 
> 
> Wow, so we had so many  factors leading to this:
> 
> - 1-disk installations arranged as RAID0 even though not strictly needed
> 
> - having a default extent allocation hint, even for small files
> 
> - having that default hint be >= the stripe unit size
> 
> - the user not removing snapshots
> 
> - XFS not falling back to unaligned allocations

Everything but the last is true. XFS is definitely dropping the
alignment hint once there are no more aligned contiguous free space
extents.

> >>Suppose I write a 4k file with a 1MB hint. How is that trailing
> >>(1MB-4k) marked? Free extent, free extent with extra annotation, or
> >>allocated extent? We may need to deallocate those extents? (will
> >>FALLOC_FL_PUNCH_HOLE do the trick?)
> >It's an unwritten extent beyond EOF, and how that is treated when
> >the file is last closed depends on how that extent was allocated.
> >But, yes, punching the range beyond EOF will definitely free it.
> 
> I think we can conclude from the dump that the filesystem freed it?

*nod*

>  ext:    logical_offset:      physical_offset: length: expected: flags:
>   0:     0..    1eb2:    3928e00..   392acb2:   1eb3:
>   1:     1eb3..    3cb2:    3c91200..   3c92fff:   1e00: 392acb3:
>   2:     3cb3..    57b2:    3454100..   3455bff:   1b00: 3c93000:
>   3:     57b3..    6fb2:    34ecd00..   34ee4ff:   1800: 3455c00:
>   4:     6fb3..    85fe:    3386a00..   338804b:   164c: 34ee500:
>   5:     85ff..    9c0b:    2c85c00..   2c8720c:   160d: 338804c:
>   6:     9c0c..    b217:    3099900..   309af0b:   160c: 2c8720d:
>   7:     b218..    c823:    34fb300..   34fc90b:   160c: 309af0c:
>   8:     c824..    de2b:    315ef00..   3160507:   1608: 34fc90c:
>   9:     de2c..    f42f:    36adc00..   36af203:   1604: 3160508:
>   10:    f430..    10a30:    2cf4400..   2cf5a00:   1601: 36af204:
>   11:    10a31..   12030:    2e03300..   2e048ff:   1600: 2cf5a01:
>   12:    12031..   13630:    2ff5200..   2ff67ff:   1600: 2e04900:
>   13:    13631..   14c30:    3199e00..   319b3ff:   1600: 2ff6800:
>   14:    14c31..   16230:    32ed500..   32eeaff:   1600: 319b400:
>   15:    16231..   17830:    34a0b00..   34a20ff:   1600: 32eeb00:
>   16:    17831..   18e30:    354e700..   354fcff:   1600: 34a2100:
>   17:    18e31..   1a430:    362c400..   362d9ff:   1600: 354fd00:
>   18:    1a431..   1ba1d:    3192b00..   31940ec:   15ed: 362da00:
>   19:    1ba1e..   1d05c:    4228500..   4229b3e:   163f: 31940ed:
>   20:    1d05d..   1e692:    3f6c900..   3f6df35:   1636: 4229b3f:
>   21:    1e693..   1fcc0:    37d4400..   37d5a2d:   162e: 3f6df36:
>   22:    1fcc1..   212e4:    43f9c00..   43fb223:   1624: 37d5a2e:
>   23:    212e5..   22905:    4003500..   4004b20:   1621: 43fb224:
>   24:    22906..   23803:    1fdb900..   1fdc7fd:    efe: 4004b21: last,eof

filefrag? I find that utterly unreadable, an dwithout the command
line I don't know what the units are.  can you use 'xfs_bmap -vvp'
so that all the units are known and it automatically calculates
whethere extents are aligned or not?

> So, lengths are not always aligned, but physical_offset always is.
> So XFS relaxes the extent size hint but not alignment.

No, that is incorrect. 

Filesystems never do what people expect them to.

i.e. what you see above is because the filesystem could not find
large enough contiguous free spaces to align both the ends of the
allocation. i.e.


Freespace looks like:
	+----FF+FFFFFF+FFFFFF+FFFF-+------+

Alloc aligned w/ min len and max len

	+----FF+FFFFFF+FFFFFF+FFFF-+------+
               +WANT-THIS-BIT_HERE-+ 

But the nearest target free space extent returns:

	     fffffffffffffffffffff

So we trim the front
	       fffffffffffffffffff

if len < min len, fail (didn't happen)

if > max len, trim end (no trim, not long enough)

And so we end up allocating front aligned and short:

               +WANT-THIS-BIT_HER+ 

Leaving behind:

	+----FF+------+------+-----+------+

That's why it looks like there are aligned extents remaining, even
when there isn't.

The allocation logic is horrifically complex - it has 20-something
controlling parameters and a heap of logic, maths and fallback paths
around them. Unless you're intimately familiar with the code,
you're unlikely to infer the allocator decisions from an extent
list....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-21 14:28               ` Dave Chinner
@ 2018-10-22  8:35                 ` Avi Kivity
  2018-10-22  9:52                   ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-22  8:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 21/10/2018 17.28, Dave Chinner wrote:
> On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote:
>> On 19/10/2018 10.51, Dave Chinner wrote:
>>> On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
>>>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>>>> Can I get access to the metadump to dig around in the filesystem
>>>>>> directly so I can see how everything has ended up laid out? that
>>>>>> will help me work out what is actually occurring and determine if
>>>>>> mkfs/mount options can address the problem or whether deeper
>>>>>> allocator algorithm changes may be necessary....
>>>>> I will ask permission to share the dump.
>>>> I'll send you a link privately.
>>> Thanks - I've started looking at this - the information here is
>>> just layout stuff - I'm omitted filenames and anything else that
>>> might be identifying from the output.
>>>
>>> Looking at a commit log file:
>>>
>>> stat.size = 33554432
>>> stat.blocks = 34720
>>> fsxattr.xflags = 0x800 [----------e-----]
>>> fsxattr.projid = 0
>>> fsxattr.extsize = 33554432
>>> fsxattr.cowextsize = 0
>>> fsxattr.nextents = 14
>>>
>>>
>>> and the layout:
>>>
>>> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
>>>    0: [0..4079]:       2646677520..2646681599 22 (95606800..95610879)  4080 001010
>>>    1: [4080..8159]:    2643130384..2643134463 22 (92059664..92063743)  4080 001010
>>>    2: [8160..12239]:   2642124816..2642128895 22 (91054096..91058175)  4080 001010
>>>    3: [12240..16319]:  2640666640..2640670719 22 (89595920..89599999)  4080 001010
>>>    4: [16320..18367]:  2640523264..2640525311 22 (89452544..89454591)  2048 000000
>>>    5: [18368..20415]:  2640119808..2640121855 22 (89049088..89051135)  2048 000000
>>>    6: [20416..21287]:  2639874064..2639874935 22 (88803344..88804215)   872 001111
>>>    7: [21288..21295]:  2639874936..2639874943 22 (88804216..88804223)     8 011111
>>>    8: [21296..24495]:  2639874944..2639878143 22 (88804224..88807423)  3200 001010
>>>    9: [24496..26543]:  2639427584..2639429631 22 (88356864..88358911)  2048 000000
>>>   10: [26544..28591]:  2638981120..2638983167 22 (87910400..87912447)  2048 000000
>>>   11: [28592..30639]:  2638770176..2638772223 22 (87699456..87701503)  2048 000000
>>>   12: [30640..31279]:  2638247952..2638248591 22 (87177232..87177871)   640 001111
>>>   13: [31280..34719]:  2638248592..2638252031 22 (87177872..87181311)  3440 011010
>>>   14: [34720..65535]:  hole                                           30816
>>>
>>> The first thing I note is the initial allocations are just short of
>>> 2MB and so the extent size hint is, indeed, being truncated here
>>> according to contiguous free space limitations. I had thought that
>>> should occur from reading the code, but it's complex and I wasn't
>>> 100% certain what minimum allocation length would be used.
>>>
>>> Looking at the system batchlog files, I'm guessing the filesystem
>>> ran out of contiguous 32MB free space extents some time around
>>> September 25. The *Data.db files from 24 Sep and earlier then are
>>> all nice 32MB extents, from 25 sep onwards they never make the full
>>> 32MB (30-31MB max). eg, good:
>>>
>>>   EXT: FILE-OFFSET       BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
>>>     0: [0..65535]:       350524552..350590087  3 (2651272..2716807)   65536 001111
>>>     1: [65536..131071]:  353378024..353443559  3 (5504744..5570279)   65536 001111
>>>     2: [131072..196607]: 355147016..355212551  3 (7273736..7339271)   65536 001111
>>>     3: [196608..262143]: 360029416..360094951  3 (12156136..12221671) 65536 001111
>>>     4: [262144..327679]: 362244144..362309679  3 (14370864..14436399) 65536 001111
>>>     5: [327680..343415]: 365809456..365825191  3 (17936176..17951911) 15736 001111
>>>
>>> bad:
>>>
>>> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
>>>    0: [0..64127]:       512855496..512919623  4 (49024456..49088583) 64128 001111
>>>    1: [64128..128247]:  266567048..266631167  2 (34651528..34715647) 64120 001010
>>>    2: [128248..142327]: 264401888..264415967  2 (32486368..32500447) 14080 001111
>>
>> So extent size is a hint but the extent alignment is a hard
>> requirement.
> No, physical alignment is being ignored here, too. THose flags on
> the end?
>
>   FLAG Values:
>      0100000 Shared extent
>      0010000 Unwritten preallocated extent
>      0001000 Doesn't begin on stripe unit
>      0000100 Doesn't end   on stripe unit
>      0000010 Doesn't begin on stripe width
>      0000001 Doesn't end   on stripe width
>
> When you have 001111, the allocation was completely unaligned.
> When you have 001010, the tail is stripe aligned
> When you ahve 000000, the head and tail are stripe aligned
>
> As you can see, there is a mix of aligned, tail aligned and
> completely unaligned extents.
>
> So, no, XFS is droping both size hints and alignment hints when
> it starts running out of aligned contiguous free space extents.


You are right; I searched for and found some files that where 
head-aligned, and jumped to conclusions, but there are many that are 
not. Those head-aligned files probably belonged to an era in that 
filesystem's life where head-aligned extents less than 1MB were available.


>>> Ok, so the results is not perfect, but there are now huge contiguous
>>> free space extents available again - ~70% of the free space is now
>>> contiguous extents >=32MB in length. There's every chance that the
>>> fs would confinue to help reform large contiguous free spaces as the
>>> database files come and go now, as long as the snapshot problem is
>>> dealt with.
>>>
>>> So, what's the problem? Well, it's simply that the workload is
>>> mixing data with vastly different temporal characteristics in the
>>> same physical locality. Every half an hour, a set of ~100 smallish
>>> files are written into a new directory which lands them at the low
>>> endof the largest free space extent in that AG. Each new snapshot
>>> directory ends up in a different AG, so it slowly spreads the
>>> snapshots across all the AGs in the filesystem.
>>
>> Not exactly - those snapshots are hard links into the live database
>> files, which eventually get removed. Usually, small files get
>> removed early, but with the snapshots they get to live forever.
> They might be created as hard links, but the effect when the
> orginal database file links are removed is the same - the snapshotted
> data lives forever, interleaved amongst short term data.


Yes.


>>> Each snapshot effective appends to the current working area in the
>>> AG, chopping it out of the largest contiguous free space. By the
>>> time the next snapshot in that AG comes around, there's other new
>>> short term data between the old snapshot and the new one. The new
>>> snapshot chops up the largest freespace, and on goes the cycle.
>>>
>>> Eventually the short term data between the snapshots gets removed,
>>> but this doesn't reform large contiguous free spaces because the
>>> snapshot data is in the way. And so this cycle continues with the
>>> snapshot data chopping up the largest freespace extents in the
>>> filesystem until there's not more large free space extents to be
>>> found.
>>>
>>> The solution is to manage the snapshot data better. We need to keep
>>> all the long term data physically isolated from the short term data
>>> so they don't fragment free space. A short term application level
>>> solution would require migrating the snapshot data out of the
>>> filesystem to somewhere else and point to it with symlinks.
>>
>> Snapshots should not live forever on the disk. The procedure is to
>> create a snapshot, copy it away, and then delete the snapshot. It's
>> okay to let snapshots live for a while, but not all of them and not
>> without a bound on their lifetime.
>>
>>
>> The filesystem did have a role in this, by requiring alignment of
>> the extent to the RAID stripe size.
> No, in the end it didn't.


Right.


>
>> Now, given that this was a RAID
>> with one member, alignment is pointless, but most of our deployments
>> are to RAID arrays with >1 members, and alignment does save 12.5% of
>> IOPS compared to un-aligned extents for compactions and writes (our
>> scans/writes use 128k buffers, and the alignment is to 1MB). The
>> database caused the problem by indirectly requiring 1MB alignment
>> for files that are much smaller than 1MB, and the user contributed
>> to the problem by causing millions of such small files to be kept.
> *nod*
>
>>> <ding>
>>>
>>> Hold on....
>>>
>>> <rummage in code>
>>>
>>> ....we already have an interface so setting those sorts of hints.
>>>
>>> fcntl(F_SET_RW_HINT, rw_hint)
>>>
>>> /*
>>>   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
>>>   * used to clear any hints previously set.
>>>   */
>>> #define RWF_WRITE_LIFE_NOT_SET  0
>>> #define RWH_WRITE_LIFE_NONE     1
>>> #define RWH_WRITE_LIFE_SHORT    2
>>> #define RWH_WRITE_LIFE_MEDIUM   3
>>> #define RWH_WRITE_LIFE_LONG     4
>>> #define RWH_WRITE_LIFE_EXTREME  5
>>>
>>> Avi, does this sound like something that you could use to
>>> classify the different types of data the data base writes out?
>>
>> So long as the penalty for a mis-classification is not too large, we
>> can for sure.
> OK.
>
>>> I'll need to have a think about how to apply this to the allocator
>>> policy algorithms before going any further, but I suspect making use
>>> of this hint interface will allow us prevent interleaving of short
>>> and long term data so avoid the freespace fragmentation it is
>>> causing here....
>>
>> IIUC, the problem (of having ENOSPC on a 10% used disk) is not
>> fragmentation per se, it's the alignment requirement.
> Which, as I've noted above, alignment is a hint, not a requirement.
>
>> To take it to
>> extreme, a 1TB disk can only hold a million files if those files
>> must be aligned to 1MB, even if everything is perfectly laid out.
>> For sure fragmentation would have degraded performance sooner or
>> later, but that's not as bad as that ENOSPC.
> What it comes down to is that having looked into it, I don't know
> why that ENOSPC error occurred.
>
> Alignment didn't cause it because alignment was being dropped - that
> just caused free space fragmentation.  Extent size hints didn't
> cause it because the size hints were dropped - that just caused
> freespace fragmentation. A lack of free space
> didn't cause it, because there was heaps of free space in all
> allocation groups.
>
> But something tickled a corner case that triggered an allocation
> failure that was interpretted as ENOSPC rather than retrying the
> allocation.  Until I can reproduce the ENOSPC allocation failure
> (and I tried!) then it'll be a mystery as to what caused it.


The user reported the error happening multiple times, taking many hours 
to reproduce, but on more than one node. So it's an obscure corner case 
but not obscure enough to be a one-off event.


I've asked the user to regularly trim their snapshots (they we're not 
aware of the snapshots actually - they were performed as a side effect 
of a TRUNCATE operation), and we'll remove the default extent hint for 
small files. I'll also consider noalign - the 12.5% reduction in IOPS is 
perhaps not worth the fragmentation it generates.


>
>> entire file. But I think that, given that the extent size is treated
>> as a hint (or so I infer from the fact that we have <32MB extents),
>> so should the alignment. Perhaps allocation with a hint should be
>> performed in two passes, first trying to match size and alignment,
>> and second relaxing both restrictions.
> I think I already mentioned there were 5 separate attmepts to
> allocate, each failure reducing restrictions:
>
> 1. extent sized and contiguous to adjacent block in file
> 2. extent sized and aligned, at higher block in AG
> 3. extent sized, not aligned, at higher block in AG
> 4. >= minimum length, not aligned, anywhere in AG >= target AG


Surprised at this one. Won't it skew usage in high AGs?


Perhaps it's rare enough not to matter.


Perhaps those higher-block/higher-AG heuristics can be improved for 
non-rotational media.


> 5. minimum length, not aligned, in any AG


Thanks for your patience in helping me understand this issue.


Avi


> Cheers,
>
> Dave.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-22  8:35                 ` Avi Kivity
@ 2018-10-22  9:52                   ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2018-10-22  9:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Mon, Oct 22, 2018 at 11:35:26AM +0300, Avi Kivity wrote:
> 
> On 21/10/2018 17.28, Dave Chinner wrote:
> >On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote:
> >>For sure fragmentation would have degraded performance sooner or
> >>later, but that's not as bad as that ENOSPC.
> >What it comes down to is that having looked into it, I don't know
> >why that ENOSPC error occurred.
> >
> >Alignment didn't cause it because alignment was being dropped - that
> >just caused free space fragmentation.  Extent size hints didn't
> >cause it because the size hints were dropped - that just caused
> >freespace fragmentation. A lack of free space
> >didn't cause it, because there was heaps of free space in all
> >allocation groups.
> >
> >But something tickled a corner case that triggered an allocation
> >failure that was interpretted as ENOSPC rather than retrying the
> >allocation.  Until I can reproduce the ENOSPC allocation failure
> >(and I tried!) then it'll be a mystery as to what caused it.
> 
> 
> The user reported the error happening multiple times, taking many
> hours to reproduce, but on more than one node. So it's an obscure
> corner case but not obscure enough to be a one-off event.

Yeah, as with all these sorts of things, the difficulty is in
reproducing it. I'll have a look through some of the higher level
code during the week to see if there's a min/max len condition I
missed somewhere that might lead to failure instead of a retry.
Because it shouldn't really fail at all because in the end a single
block allocation is allowable for normal extent size w/ alignemnt
allocation and there is heaps of free available.

> >>entire file. But I think that, given that the extent size is treated
> >>as a hint (or so I infer from the fact that we have <32MB extents),
> >>so should the alignment. Perhaps allocation with a hint should be
> >>performed in two passes, first trying to match size and alignment,
> >>and second relaxing both restrictions.
> >I think I already mentioned there were 5 separate attmepts to
> >allocate, each failure reducing restrictions:
> >
> >1. extent sized and contiguous to adjacent block in file
> >2. extent sized and aligned, at higher block in AG
> >3. extent sized, not aligned, at higher block in AG
> >4. >= minimum length, not aligned, anywhere in AG >= target AG
> 
> 
> Surprised at this one. Won't it skew usage in high AGs?

It's a constraint based on AG locking order. We always lock in
ascending AG order, so if we've locked AG 4 and modified the free
list in preparation for allocation, then failed to find an aligned
extent, that will remain locked until we finish the allocation
process and hence we can't lock AGs <= AG 4 otherwise we risk
deadlocking the allocator.....

> Perhaps it's rare enough not to matter.

It tends to be reare because we chose the ag ahead of time to ensure
that the majority of the time there is space available.

> Thanks for your patience in helping me understand this issue.

No worries, what I'm here for :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2018-10-17  7:52 ENSOPC on a 10% used disk Avi Kivity
                   ` (2 preceding siblings ...)
  2018-10-18 15:54 ` Eric Sandeen
@ 2019-02-05 21:48 ` Dave Chinner
  2019-02-07 10:51   ` Avi Kivity
  3 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2019-02-05 21:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

Hi Avi,

On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown
> by df), getting sporadic ENOSPC errors. The disk is mounted with
> inode64 and has a relatively small number of large files. The disk
> is a single-member RAID0 array, with 1MB chunk size. There are 32
> AGs. Running Linux 4.9.17.
> 
> 
> The write load consists of AIO/DIO writes, followed by unlinks of
> these files. The writes are non-size-changing (we truncate ahead)
> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
> 32MB. The errors happen on commit logs, which have a target size of
> 32MB (but may exceed it a little).
> 
> 
> The errors are sporadic and after restarting the workload they go
> away for a few hours to a few days, but then return. During one of
> the crashes I used xfs_db to look at fragmentation and saw that most
> AGs had free extents of size categories up to 128-255, but a few had
> more. I tried xfs_fsr but it did not help.
> 
> 
> Is this a known issue? Would upgrading the kernel help?

Long time, I know, but Brian has just made me aware of this commit
from early 2018 that went into 4.16 that might be relevant and so I
thought it best to close the loop:

commit 6d8a45ce29c7d67cc4fc3016dc2a07660c62482a
Author: Darrick J. Wong <darrick.wong@oracle.com>
Date:   Fri Jan 19 17:47:36 2018 -0800

    xfs: don't screw up direct writes when freesp is fragmented
    
    xfs_bmap_btalloc is given a range of file offset blocks that must be
    allocated to some data/attr/cow fork.  If the fork has an extent size
    hint associated with it, the request will be enlarged on both ends to
    try to satisfy the alignment hint.  If free space is fragmentated,
    sometimes we can allocate some blocks but not enough to fulfill any of
    the requested range.  Since bmapi_allocate always trims the new extent
    mapping to match the originally requested range, this results in
    bmapi_write returning zero and no mapping.
    
    The consequences of this vary -- buffered writes will simply re-call
    bmapi_write until it can satisfy at least one block from the original
    request.  Direct IO overwrites notice nmaps == 0 and return -ENOSPC
    through the dio mechanism out to userspace with the weird result that
    writes fail even when we have enough space because the ENOSPC return
    overrides any partial write status.  For direct CoW writes the situation
    was disastrous because nobody notices us returning an invalid zero-length
    wrong-offset mapping to iomap and the write goes off into space.
    
    Therefore, if free space is so fragmented that we managed to allocate
    some space but not enough to map into even a single block of the
    original allocation request range, we should break the alignment hint in
    order to guarantee at least some forward progress for the direct write.
    If we return a short allocation to iomap_apply it'll call back about the
    remaining blocks.
    
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

The spurious ENOSPC symptoms seem to match what you are seeing here
on your customer's 4.9 kernel, so it may be that this is the fix for
the ENOSPC problem that was reported. If this comes up again, then
perhaps it would be worth either upgrading the kernel to 4.16+ or
backporting this commit to see if it fixes the problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: ENSOPC on a 10% used disk
  2019-02-05 21:48 ` Dave Chinner
@ 2019-02-07 10:51   ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2019-02-07 10:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


On 05/02/2019 23.48, Dave Chinner wrote:
> Hi Avi,
>
> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>> inode64 and has a relatively small number of large files. The disk
>> is a single-member RAID0 array, with 1MB chunk size. There are 32
>> AGs. Running Linux 4.9.17.
>>
>>
>> The write load consists of AIO/DIO writes, followed by unlinks of
>> these files. The writes are non-size-changing (we truncate ahead)
>> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
>> 32MB. The errors happen on commit logs, which have a target size of
>> 32MB (but may exceed it a little).
>>
>>
>> The errors are sporadic and after restarting the workload they go
>> away for a few hours to a few days, but then return. During one of
>> the crashes I used xfs_db to look at fragmentation and saw that most
>> AGs had free extents of size categories up to 128-255, but a few had
>> more. I tried xfs_fsr but it did not help.
>>
>>
>> Is this a known issue? Would upgrading the kernel help?
> Long time, I know, but Brian has just made me aware of this commit
> from early 2018 that went into 4.16 that might be relevant and so I
> thought it best to close the loop:
>
> commit 6d8a45ce29c7d67cc4fc3016dc2a07660c62482a
> Author: Darrick J. Wong <darrick.wong@oracle.com>
> Date:   Fri Jan 19 17:47:36 2018 -0800
>
>      xfs: don't screw up direct writes when freesp is fragmented
>      
>      xfs_bmap_btalloc is given a range of file offset blocks that must be
>      allocated to some data/attr/cow fork.  If the fork has an extent size
>      hint associated with it, the request will be enlarged on both ends to
>      try to satisfy the alignment hint.  If free space is fragmentated,
>      sometimes we can allocate some blocks but not enough to fulfill any of
>      the requested range.  Since bmapi_allocate always trims the new extent
>      mapping to match the originally requested range, this results in
>      bmapi_write returning zero and no mapping.
>      
>      The consequences of this vary -- buffered writes will simply re-call
>      bmapi_write until it can satisfy at least one block from the original
>      request.  Direct IO overwrites notice nmaps == 0 and return -ENOSPC
>      through the dio mechanism out to userspace with the weird result that
>      writes fail even when we have enough space because the ENOSPC return
>      overrides any partial write status.  For direct CoW writes the situation
>      was disastrous because nobody notices us returning an invalid zero-length
>      wrong-offset mapping to iomap and the write goes off into space.
>      
>      Therefore, if free space is so fragmented that we managed to allocate
>      some space but not enough to map into even a single block of the
>      original allocation request range, we should break the alignment hint in
>      order to guarantee at least some forward progress for the direct write.
>      If we return a short allocation to iomap_apply it'll call back about the
>      remaining blocks.
>      
>      Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>      Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> The spurious ENOSPC symptoms seem to match what you are seeing here
> on your customer's 4.9 kernel, so it may be that this is the fix for
> the ENOSPC problem that was reported. If this comes up again, then
> perhaps it would be worth either upgrading the kernel to 4.16+ or
> backporting this commit to see if it fixes the problem.


Thanks for remembering. Indeed it looks like a good match for the 
problem. We did not see the problem again (it took quite a combination 
of screwups to achieve), but I'll remember this in case that we do.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-02-07 10:51 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-17  7:52 ENSOPC on a 10% used disk Avi Kivity
2018-10-17  8:47 ` Christoph Hellwig
2018-10-17  8:57   ` Avi Kivity
2018-10-17 10:54     ` Avi Kivity
2018-10-18  1:37 ` Dave Chinner
2018-10-18  7:55   ` Avi Kivity
2018-10-18 10:05     ` Dave Chinner
2018-10-18 11:00       ` Avi Kivity
2018-10-18 13:36         ` Avi Kivity
2018-10-19  7:51           ` Dave Chinner
2018-10-21  8:55             ` Avi Kivity
2018-10-21 14:28               ` Dave Chinner
2018-10-22  8:35                 ` Avi Kivity
2018-10-22  9:52                   ` Dave Chinner
2018-10-18 15:44         ` Avi Kivity
2018-10-18 16:11           ` Avi Kivity
2018-10-19  1:24           ` Dave Chinner
2018-10-21  9:00             ` Avi Kivity
2018-10-21 14:34               ` Dave Chinner
2018-10-19  1:15         ` Dave Chinner
2018-10-21  9:21           ` Avi Kivity
2018-10-21 15:06             ` Dave Chinner
2018-10-18 15:54 ` Eric Sandeen
2018-10-21 11:49   ` Avi Kivity
2019-02-05 21:48 ` Dave Chinner
2019-02-07 10:51   ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.