All of lore.kernel.org
 help / color / mirror / Atom feed
* Is it possible to speed up unlink()?
@ 2016-10-20  9:29 Timofey Titovets
  2016-10-20 12:09 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 8+ messages in thread
From: Timofey Titovets @ 2016-10-20  9:29 UTC (permalink / raw)
  To: linux-btrfs

Hi, i use btrfs for NFS VM replica storage and for NFS shared VM storage.
At now i have a small problem what VM image deletion took to long time
and NFS client show a timeout on deletion
(ESXi Storage migration as example).

Kernel: Linux nfs05 4.7.0-0.bpo.1-amd64 #1 SMP Debian 4.7.5-1~bpo8+2
(2016-10-01) x86_64 GNU/Linux
Mount options: noatime,compress-force=zlib,space_cache,commit=180
Feature enabled:
big_metadata:1
compress_lzo:1
extended_iref:1
mixed_backref:1
no_holes:1
skinny_metadata:1

AFAIK, unlink() return only when all references to all extents from
unlinked inode will be deleted
So with compression enabled files have a many many refs to each
compressed chunk.
So, it's possible to return unlink() early? or this a bad idea(and why)?

Thanks.
-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20  9:29 Is it possible to speed up unlink()? Timofey Titovets
@ 2016-10-20 12:09 ` Austin S. Hemmelgarn
  2016-10-20 13:47   ` Timofey Titovets
  2016-10-20 15:26   ` Roman Mamedov
  0 siblings, 2 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-20 12:09 UTC (permalink / raw)
  To: Timofey Titovets, linux-btrfs

On 2016-10-20 05:29, Timofey Titovets wrote:
> Hi, i use btrfs for NFS VM replica storage and for NFS shared VM storage.
> At now i have a small problem what VM image deletion took to long time
> and NFS client show a timeout on deletion
> (ESXi Storage migration as example).
>
> Kernel: Linux nfs05 4.7.0-0.bpo.1-amd64 #1 SMP Debian 4.7.5-1~bpo8+2
> (2016-10-01) x86_64 GNU/Linux
> Mount options: noatime,compress-force=zlib,space_cache,commit=180
> Feature enabled:
> big_metadata:1
> compress_lzo:1
> extended_iref:1
> mixed_backref:1
> no_holes:1
> skinny_metadata:1
>
> AFAIK, unlink() return only when all references to all extents from
> unlinked inode will be deleted
> So with compression enabled files have a many many refs to each
> compressed chunk.
> So, it's possible to return unlink() early? or this a bad idea(and why)?
I may be completely off about this, but I could have sworn that unlink() 
returns when enough info is on the disk that both:
1. The file isn't actually visible in the directory.
2. If the system crashes, the filesystem will know to finish the cleanup.

Out of curiosity, what are the mount options (and export options) for 
the NFS share?  I have a feeling that that's also contributing.  In 
particular, if you're on a reliable network, forcing UDP for mounting 
can significantly help performance, and if your server is reliable, you 
can set NFS to run asynchronously to make unlink() return almost 
immediately.

Now, on top of that, you should probably look at adding 'lazytime' to 
the mount options for BTRFS.  This will cause updates to file 
time-stamps (not just atime, but mtime also, it has no net effect on 
ctime though, because a ctime update means something else in the inode 
got updated) to be deferred up to 24 hours or until the next time the 
inode would be written out, which can significantly improve performance 
on BTRFS because of the write-amplification.  It's not hugely likely to 
improve performance for unlink(), but it should improve write 
performance some, which may help in general.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20 12:09 ` Austin S. Hemmelgarn
@ 2016-10-20 13:47   ` Timofey Titovets
  2016-10-20 14:44     ` Austin S. Hemmelgarn
  2016-10-20 15:26   ` Roman Mamedov
  1 sibling, 1 reply; 8+ messages in thread
From: Timofey Titovets @ 2016-10-20 13:47 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

2016-10-20 15:09 GMT+03:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2016-10-20 05:29, Timofey Titovets wrote:
>>
>> Hi, i use btrfs for NFS VM replica storage and for NFS shared VM storage.
>> At now i have a small problem what VM image deletion took to long time
>> and NFS client show a timeout on deletion
>> (ESXi Storage migration as example).
>>
>> Kernel: Linux nfs05 4.7.0-0.bpo.1-amd64 #1 SMP Debian 4.7.5-1~bpo8+2
>> (2016-10-01) x86_64 GNU/Linux
>> Mount options: noatime,compress-force=zlib,space_cache,commit=180
>> Feature enabled:
>> big_metadata:1
>> compress_lzo:1
>> extended_iref:1
>> mixed_backref:1
>> no_holes:1
>> skinny_metadata:1
>>
>> AFAIK, unlink() return only when all references to all extents from
>> unlinked inode will be deleted
>> So with compression enabled files have a many many refs to each
>> compressed chunk.
>> So, it's possible to return unlink() early? or this a bad idea(and why)?
>
> I may be completely off about this, but I could have sworn that unlink()
> returns when enough info is on the disk that both:
> 1. The file isn't actually visible in the directory.
> 2. If the system crashes, the filesystem will know to finish the cleanup.
>
> Out of curiosity, what are the mount options (and export options) for the
> NFS share?  I have a feeling that that's also contributing.  In particular,
> if you're on a reliable network, forcing UDP for mounting can significantly
> help performance, and if your server is reliable, you can set NFS to run
> asynchronously to make unlink() return almost immediately.


For NFS export i use:
rw,no_root_squash,async,no_subtree_check,fsid=1
AFAIK ESXi don't support nfs with udp
And you right on normal Linux client async work pretty good and
deletion of big file are pretty fast (but also it's can lock nfsd on
nfs server for long time, while he do unlink()).

> Now, on top of that, you should probably look at adding 'lazytime' to the
> mount options for BTRFS.  This will cause updates to file time-stamps (not
> just atime, but mtime also, it has no net effect on ctime though, because a
> ctime update means something else in the inode got updated) to be deferred
> up to 24 hours or until the next time the inode would be written out, which
> can significantly improve performance on BTRFS because of the
> write-amplification.  It's not hugely likely to improve performance for
> unlink(), but it should improve write performance some, which may help in
> general.

Thanks for lazytime i forgot about it %)
On my debian servers i can't apply it with error:
BTRFS info (device sdc1): unrecognized mount option 'lazytime'
But successful apply it to my arch box (Linux 4.8.2)

For fast unlink(), i just think about subvolume like behaviour, then
it's possible to fast delete subvolume (without commit) and then
kernel will clean data in the background.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20 13:47   ` Timofey Titovets
@ 2016-10-20 14:44     ` Austin S. Hemmelgarn
  2016-10-20 17:33       ` ronnie sahlberg
  0 siblings, 1 reply; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-20 14:44 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-btrfs

On 2016-10-20 09:47, Timofey Titovets wrote:
> 2016-10-20 15:09 GMT+03:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> On 2016-10-20 05:29, Timofey Titovets wrote:
>>>
>>> Hi, i use btrfs for NFS VM replica storage and for NFS shared VM storage.
>>> At now i have a small problem what VM image deletion took to long time
>>> and NFS client show a timeout on deletion
>>> (ESXi Storage migration as example).
>>>
>>> Kernel: Linux nfs05 4.7.0-0.bpo.1-amd64 #1 SMP Debian 4.7.5-1~bpo8+2
>>> (2016-10-01) x86_64 GNU/Linux
>>> Mount options: noatime,compress-force=zlib,space_cache,commit=180
>>> Feature enabled:
>>> big_metadata:1
>>> compress_lzo:1
>>> extended_iref:1
>>> mixed_backref:1
>>> no_holes:1
>>> skinny_metadata:1
>>>
>>> AFAIK, unlink() return only when all references to all extents from
>>> unlinked inode will be deleted
>>> So with compression enabled files have a many many refs to each
>>> compressed chunk.
>>> So, it's possible to return unlink() early? or this a bad idea(and why)?
>>
>> I may be completely off about this, but I could have sworn that unlink()
>> returns when enough info is on the disk that both:
>> 1. The file isn't actually visible in the directory.
>> 2. If the system crashes, the filesystem will know to finish the cleanup.
>>
>> Out of curiosity, what are the mount options (and export options) for the
>> NFS share?  I have a feeling that that's also contributing.  In particular,
>> if you're on a reliable network, forcing UDP for mounting can significantly
>> help performance, and if your server is reliable, you can set NFS to run
>> asynchronously to make unlink() return almost immediately.
>
>
> For NFS export i use:
> rw,no_root_squash,async,no_subtree_check,fsid=1
> AFAIK ESXi don't support nfs with udp
That doesn't surprise me.  If there's any chance of packet loss, then 
NFS over UDP risks data corruption, so a lot of 'professional' software 
only supports NFS over TCP.  The thing is though, in a vast majority of 
networks ESXi would be running in, there's functionally zero chance of 
packet loss unless there's a hardware failure.
> And you right on normal Linux client async work pretty good and
> deletion of big file are pretty fast (but also it's can lock nfsd on
> nfs server for long time, while he do unlink()).
You might also try with NFS-Ganesha instead of the Linux kernel NFS 
server.  It scales a whole lot better and tends to be a bit smarter, so 
it might help (especially since it gives better NFS over TCP performance 
than the kernel server too).  The only significant downside is that it's 
somewhat lacking in good documentation.
>
>> Now, on top of that, you should probably look at adding 'lazytime' to the
>> mount options for BTRFS.  This will cause updates to file time-stamps (not
>> just atime, but mtime also, it has no net effect on ctime though, because a
>> ctime update means something else in the inode got updated) to be deferred
>> up to 24 hours or until the next time the inode would be written out, which
>> can significantly improve performance on BTRFS because of the
>> write-amplification.  It's not hugely likely to improve performance for
>> unlink(), but it should improve write performance some, which may help in
>> general.
>
> Thanks for lazytime i forgot about it %)
> On my debian servers i can't apply it with error:
> BTRFS info (device sdc1): unrecognized mount option 'lazytime'
> But successful apply it to my arch box (Linux 4.8.2)
That's odd, 4.7 kernels definitely have support for it (I've been using 
it since 4.7.0 on all my systems, but I build upstream kernels).
>
> For fast unlink(), i just think about subvolume like behaviour, then
> it's possible to fast delete subvolume (without commit) and then
> kernel will clean data in the background.
There's two other possibilities I can think of to improve this.  One is 
putting each VM image in it's own subvolume, but that then means you 
almost certainly can't use ESXi to delete the images directly, although 
it will likely get you better performance overall.

The other is to see if you can use a chunked image file format.  I'm not 
sure what it would be called in VMWare, but it just amounts to splitting 
the image into a number of smaller files (4M seems to work well for most 
workloads).  This should also get you slightly better performance 
(assuming you have things aligned to the chunk size in the VM disk 
itself), and In my experience, it's generally faster on BTRFS to unlink 
lots of small files than one big file.  I think that VMDK supports this 
(it appears to in VirtualBox at least), but you may need to use a 
command-line tool to create the image instead of doing it by hand.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20 12:09 ` Austin S. Hemmelgarn
  2016-10-20 13:47   ` Timofey Titovets
@ 2016-10-20 15:26   ` Roman Mamedov
  2016-10-20 15:49     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 8+ messages in thread
From: Roman Mamedov @ 2016-10-20 15:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Timofey Titovets, linux-btrfs

On Thu, 20 Oct 2016 08:09:14 -0400
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> > So, it's possible to return unlink() early? or this a bad idea(and why)?
> I may be completely off about this, but I could have sworn that unlink() 
> returns when enough info is on the disk that both:
> 1. The file isn't actually visible in the directory.
> 2. If the system crashes, the filesystem will know to finish the cleanup.

As I understand it there is no fundamental reason why rm of a heavily
fragmented file couldn't be exactly as fast as deleting a subvolume with
only that single file in it. Remove the directory reference and instantly
return success to userspace, continuing to clean up extents in the background.

However for many uses that could be counter-productive, as scripts might
expect the disk space to be freed up completely after the rm command returns
(as they might need to start filling up the partition with new data). 

In snapshot deletion there are various commit modes built in for that purpose,
but I'm not sure if you can easily extend POSIX file deletion to implement
synchronous and non-synchronous deletion modes.

* Try the 'unlink' program instead of 'rm'; if "just remove the dir entry for
  now" was implemented anywhere, I'd expect it to be via that.
* Try doing 'eatmydata rm', but that's more of a crazy idea than anything else,
  as eatmydata only affects fsyncs, and I don't think rm is necessarily
  invoking those.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20 15:26   ` Roman Mamedov
@ 2016-10-20 15:49     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-20 15:49 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Timofey Titovets, linux-btrfs

On 2016-10-20 11:26, Roman Mamedov wrote:
> On Thu, 20 Oct 2016 08:09:14 -0400
> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>
>>> So, it's possible to return unlink() early? or this a bad idea(and why)?
>> I may be completely off about this, but I could have sworn that unlink()
>> returns when enough info is on the disk that both:
>> 1. The file isn't actually visible in the directory.
>> 2. If the system crashes, the filesystem will know to finish the cleanup.
>
> As I understand it there is no fundamental reason why rm of a heavily
> fragmented file couldn't be exactly as fast as deleting a subvolume with
> only that single file in it. Remove the directory reference and instantly
> return success to userspace, continuing to clean up extents in the background.
The tree cleanup is actually a bit easier for a subvolume since it's the 
root of it's own tree.  This in turn means that there is less that 
actually needs to be written for a subvolume with a single file in it to 
be deleted than for the file by itself to be deleted, since the write 
doesn't propagate up quite as many trees.

The thing is though that since the NFS export is set to async mode, the 
unlink should return almost immediately anyway.

The other issue is that the type file in question is a pathological case 
for any COW filesystem, not just BTRFS, and this behavior is pretty well 
understood.  Once you get past about 8G for a VM image on BTRFS, you 
either need to be looking at real block storage (LVM or something 
similar with the image exported using something like iSCSI or NBD), make 
absolutely certain the file is pre-allocated and marked NOCOW, or use a 
split file format.
>
> However for many uses that could be counter-productive, as scripts might
> expect the disk space to be freed up completely after the rm command returns
> (as they might need to start filling up the partition with new data).
'Might' is an understatement, scripts _do_ expect the disk space to free 
up immediately, and this has caused a number of issues with various 
tools on BTRFS.  It's also an issue because just about everything 
expects unlink() to be functionally synchronous (ie, unlink() shouldn't 
have an impact on other operations if it's already returned).
>
> In snapshot deletion there are various commit modes built in for that purpose,
> but I'm not sure if you can easily extend POSIX file deletion to implement
> synchronous and non-synchronous deletion modes.
There isn't.  In theory it could be implemented as a mount option, but 
even that gets risky for the same reason taht implementing it globally 
is potentially problematic.
>
> * Try the 'unlink' program instead of 'rm'; if "just remove the dir entry for
>   now" was implemented anywhere, I'd expect it to be via that.
'rm' just puts a nice UI on the unlink() call, 'unlink' just calls it 
directly, so I severely doubt that it will have any impact.
> * Try doing 'eatmydata rm', but that's more of a crazy idea than anything else,
>   as eatmydata only affects fsyncs, and I don't think rm is necessarily
>   invoking those.
It isn't, so this almost certainly won't help.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20 14:44     ` Austin S. Hemmelgarn
@ 2016-10-20 17:33       ` ronnie sahlberg
  2016-10-20 17:44         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 8+ messages in thread
From: ronnie sahlberg @ 2016-10-20 17:33 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Timofey Titovets, linux-btrfs

On Thu, Oct 20, 2016 at 7:44 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-10-20 09:47, Timofey Titovets wrote:
>>
>> 2016-10-20 15:09 GMT+03:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>
>>> On 2016-10-20 05:29, Timofey Titovets wrote:
>>>>
>>>>
>>>> Hi, i use btrfs for NFS VM replica storage and for NFS shared VM
>>>> storage.
>>>> At now i have a small problem what VM image deletion took to long time
>>>> and NFS client show a timeout on deletion
>>>> (ESXi Storage migration as example).
>>>>
>>>> Kernel: Linux nfs05 4.7.0-0.bpo.1-amd64 #1 SMP Debian 4.7.5-1~bpo8+2
>>>> (2016-10-01) x86_64 GNU/Linux
>>>> Mount options: noatime,compress-force=zlib,space_cache,commit=180
>>>> Feature enabled:
>>>> big_metadata:1
>>>> compress_lzo:1
>>>> extended_iref:1
>>>> mixed_backref:1
>>>> no_holes:1
>>>> skinny_metadata:1
>>>>
>>>> AFAIK, unlink() return only when all references to all extents from
>>>> unlinked inode will be deleted
>>>> So with compression enabled files have a many many refs to each
>>>> compressed chunk.
>>>> So, it's possible to return unlink() early? or this a bad idea(and why)?
>>>
>>>
>>> I may be completely off about this, but I could have sworn that unlink()
>>> returns when enough info is on the disk that both:
>>> 1. The file isn't actually visible in the directory.
>>> 2. If the system crashes, the filesystem will know to finish the cleanup.
>>>
>>> Out of curiosity, what are the mount options (and export options) for the
>>> NFS share?  I have a feeling that that's also contributing.  In
>>> particular,
>>> if you're on a reliable network, forcing UDP for mounting can
>>> significantly
>>> help performance, and if your server is reliable, you can set NFS to run
>>> asynchronously to make unlink() return almost immediately.
>>
>>
>>
>> For NFS export i use:
>> rw,no_root_squash,async,no_subtree_check,fsid=1
>> AFAIK ESXi don't support nfs with udp
>
> That doesn't surprise me.  If there's any chance of packet loss, then NFS
> over UDP risks data corruption, so a lot of 'professional' software only
> supports NFS over TCP.  The thing is though, in a vast majority of networks
> ESXi would be running in, there's functionally zero chance of packet loss
> unless there's a hardware failure.
>>
>> And you right on normal Linux client async work pretty good and
>> deletion of big file are pretty fast (but also it's can lock nfsd on
>> nfs server for long time, while he do unlink()).
>
> You might also try with NFS-Ganesha instead of the Linux kernel NFS server.
> It scales a whole lot better and tends to be a bit smarter, so it might help
> (especially since it gives better NFS over TCP performance than the kernel
> server too).  The only significant downside is that it's somewhat lacking in
> good documentation.

He is using NFS and removing a single file.
This involves only two packets to be exchanged between client and server
-> NFSv3 REMOVE resquest
and
<- NFSv3 REMOTE reply

These packets are both < 100 bytes in size.
On the server side, both knfsd.ko as well as Ganesha both pretty much just
calls unlink() for this request.

This looks like a pure BTRFS issue and I can not see how kngsd vs
ganesha or tcp vs udp can help.
Traditional nfs clients allow to tweak for impossibly slow servers,
for example using the 'timeo' client mount
option.


Maybe ESXi has a similar option to make it more tolerant to "when the
server does not respond within
reasonable timeout so we might need to consider the server dead and
return EIO to the application."




>>
>>
>>> Now, on top of that, you should probably look at adding 'lazytime' to the
>>> mount options for BTRFS.  This will cause updates to file time-stamps
>>> (not
>>> just atime, but mtime also, it has no net effect on ctime though, because
>>> a
>>> ctime update means something else in the inode got updated) to be
>>> deferred
>>> up to 24 hours or until the next time the inode would be written out,
>>> which
>>> can significantly improve performance on BTRFS because of the
>>> write-amplification.  It's not hugely likely to improve performance for
>>> unlink(), but it should improve write performance some, which may help in
>>> general.
>>
>>
>> Thanks for lazytime i forgot about it %)
>> On my debian servers i can't apply it with error:
>> BTRFS info (device sdc1): unrecognized mount option 'lazytime'
>> But successful apply it to my arch box (Linux 4.8.2)
>
> That's odd, 4.7 kernels definitely have support for it (I've been using it
> since 4.7.0 on all my systems, but I build upstream kernels).
>>
>>
>> For fast unlink(), i just think about subvolume like behaviour, then
>> it's possible to fast delete subvolume (without commit) and then
>> kernel will clean data in the background.
>
> There's two other possibilities I can think of to improve this.  One is
> putting each VM image in it's own subvolume, but that then means you almost
> certainly can't use ESXi to delete the images directly, although it will
> likely get you better performance overall.
>
> The other is to see if you can use a chunked image file format.  I'm not
> sure what it would be called in VMWare, but it just amounts to splitting the
> image into a number of smaller files (4M seems to work well for most
> workloads).  This should also get you slightly better performance (assuming
> you have things aligned to the chunk size in the VM disk itself), and In my
> experience, it's generally faster on BTRFS to unlink lots of small files
> than one big file.  I think that VMDK supports this (it appears to in
> VirtualBox at least), but you may need to use a command-line tool to create
> the image instead of doing it by hand.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is it possible to speed up unlink()?
  2016-10-20 17:33       ` ronnie sahlberg
@ 2016-10-20 17:44         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-20 17:44 UTC (permalink / raw)
  To: ronnie sahlberg; +Cc: Timofey Titovets, linux-btrfs

On 2016-10-20 13:33, ronnie sahlberg wrote:
> On Thu, Oct 20, 2016 at 7:44 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-10-20 09:47, Timofey Titovets wrote:
>>>
>>> 2016-10-20 15:09 GMT+03:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>
>>>> On 2016-10-20 05:29, Timofey Titovets wrote:
>>>>>
>>>>>
>>>>> Hi, i use btrfs for NFS VM replica storage and for NFS shared VM
>>>>> storage.
>>>>> At now i have a small problem what VM image deletion took to long time
>>>>> and NFS client show a timeout on deletion
>>>>> (ESXi Storage migration as example).
>>>>>
>>>>> Kernel: Linux nfs05 4.7.0-0.bpo.1-amd64 #1 SMP Debian 4.7.5-1~bpo8+2
>>>>> (2016-10-01) x86_64 GNU/Linux
>>>>> Mount options: noatime,compress-force=zlib,space_cache,commit=180
>>>>> Feature enabled:
>>>>> big_metadata:1
>>>>> compress_lzo:1
>>>>> extended_iref:1
>>>>> mixed_backref:1
>>>>> no_holes:1
>>>>> skinny_metadata:1
>>>>>
>>>>> AFAIK, unlink() return only when all references to all extents from
>>>>> unlinked inode will be deleted
>>>>> So with compression enabled files have a many many refs to each
>>>>> compressed chunk.
>>>>> So, it's possible to return unlink() early? or this a bad idea(and why)?
>>>>
>>>>
>>>> I may be completely off about this, but I could have sworn that unlink()
>>>> returns when enough info is on the disk that both:
>>>> 1. The file isn't actually visible in the directory.
>>>> 2. If the system crashes, the filesystem will know to finish the cleanup.
>>>>
>>>> Out of curiosity, what are the mount options (and export options) for the
>>>> NFS share?  I have a feeling that that's also contributing.  In
>>>> particular,
>>>> if you're on a reliable network, forcing UDP for mounting can
>>>> significantly
>>>> help performance, and if your server is reliable, you can set NFS to run
>>>> asynchronously to make unlink() return almost immediately.
>>>
>>>
>>>
>>> For NFS export i use:
>>> rw,no_root_squash,async,no_subtree_check,fsid=1
>>> AFAIK ESXi don't support nfs with udp
>>
>> That doesn't surprise me.  If there's any chance of packet loss, then NFS
>> over UDP risks data corruption, so a lot of 'professional' software only
>> supports NFS over TCP.  The thing is though, in a vast majority of networks
>> ESXi would be running in, there's functionally zero chance of packet loss
>> unless there's a hardware failure.
>>>
>>> And you right on normal Linux client async work pretty good and
>>> deletion of big file are pretty fast (but also it's can lock nfsd on
>>> nfs server for long time, while he do unlink()).
>>
>> You might also try with NFS-Ganesha instead of the Linux kernel NFS server.
>> It scales a whole lot better and tends to be a bit smarter, so it might help
>> (especially since it gives better NFS over TCP performance than the kernel
>> server too).  The only significant downside is that it's somewhat lacking in
>> good documentation.
>
> He is using NFS and removing a single file.
> This involves only two packets to be exchanged between client and server
> -> NFSv3 REMOVE resquest
> and
> <- NFSv3 REMOTE reply
>
> These packets are both < 100 bytes in size.
> On the server side, both knfsd.ko as well as Ganesha both pretty much just
> calls unlink() for this request.
>
> This looks like a pure BTRFS issue and I can not see how kngsd vs
> ganesha or tcp vs udp can help.
I never said I thought it was hugely likely to help, I just said it 
might.  The suggestion of trying Ganesha was more directed at the 
server-side lockup during the unlink operation, and I would be 
completely unsurprised if it handles this better in that respect than knfsd.

As far as TCP vs UDP though, you might be surprised.  Just by raw packet 
count, TCP doubles your overhead (because of the ACK packets), and 
because of the extra processing involved in the networking stack just 
from it being TCP, it can cause a pretty significant impact on NFS 
performance.

Most of the issue still is of course BTRFS, but as I mentioned in at 
least one other reply in this thread, it's an issue with BTRFS that's 
well documented for this usage (VM disk image storage, not NFS exports).
> Traditional nfs clients allow to tweak for impossibly slow servers,
> for example using the 'timeo' client mount
> option.
>
>
> Maybe ESXi has a similar option to make it more tolerant to "when the
> server does not respond within
> reasonable timeout so we might need to consider the server dead and
> return EIO to the application."
ESXi is proprietary 'enterprise' software, so I doubt it.
>
>
>
>
>>>
>>>
>>>> Now, on top of that, you should probably look at adding 'lazytime' to the
>>>> mount options for BTRFS.  This will cause updates to file time-stamps
>>>> (not
>>>> just atime, but mtime also, it has no net effect on ctime though, because
>>>> a
>>>> ctime update means something else in the inode got updated) to be
>>>> deferred
>>>> up to 24 hours or until the next time the inode would be written out,
>>>> which
>>>> can significantly improve performance on BTRFS because of the
>>>> write-amplification.  It's not hugely likely to improve performance for
>>>> unlink(), but it should improve write performance some, which may help in
>>>> general.
>>>
>>>
>>> Thanks for lazytime i forgot about it %)
>>> On my debian servers i can't apply it with error:
>>> BTRFS info (device sdc1): unrecognized mount option 'lazytime'
>>> But successful apply it to my arch box (Linux 4.8.2)
>>
>> That's odd, 4.7 kernels definitely have support for it (I've been using it
>> since 4.7.0 on all my systems, but I build upstream kernels).
>>>
>>>
>>> For fast unlink(), i just think about subvolume like behaviour, then
>>> it's possible to fast delete subvolume (without commit) and then
>>> kernel will clean data in the background.
>>
>> There's two other possibilities I can think of to improve this.  One is
>> putting each VM image in it's own subvolume, but that then means you almost
>> certainly can't use ESXi to delete the images directly, although it will
>> likely get you better performance overall.
>>
>> The other is to see if you can use a chunked image file format.  I'm not
>> sure what it would be called in VMWare, but it just amounts to splitting the
>> image into a number of smaller files (4M seems to work well for most
>> workloads).  This should also get you slightly better performance (assuming
>> you have things aligned to the chunk size in the VM disk itself), and In my
>> experience, it's generally faster on BTRFS to unlink lots of small files
>> than one big file.  I think that VMDK supports this (it appears to in
>> VirtualBox at least), but you may need to use a command-line tool to create
>> the image instead of doing it by hand.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-10-20 17:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-20  9:29 Is it possible to speed up unlink()? Timofey Titovets
2016-10-20 12:09 ` Austin S. Hemmelgarn
2016-10-20 13:47   ` Timofey Titovets
2016-10-20 14:44     ` Austin S. Hemmelgarn
2016-10-20 17:33       ` ronnie sahlberg
2016-10-20 17:44         ` Austin S. Hemmelgarn
2016-10-20 15:26   ` Roman Mamedov
2016-10-20 15:49     ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.