All of lore.kernel.org
 help / color / mirror / Atom feed
* Announcing btrfs-dedupe
@ 2016-11-06 13:30 James Pharaoh
  2016-11-07 14:02 ` David Sterba
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: James Pharaoh @ 2016-11-06 13:30 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

I'm pleased to announce my btrfs deduplication utility, written in Rust. 
This operates on whole files, is fast, and I believe complements the 
existing utilities (duperemove, bedup), which exist currently.

Please visit the homepage for more information:

http://btrfs-dedupe.com

James Pharaoh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-06 13:30 Announcing btrfs-dedupe James Pharaoh
@ 2016-11-07 14:02 ` David Sterba
  2016-11-07 17:48   ` Mark Fasheh
  2016-11-08  2:40   ` Christoph Anton Mitterer
  2016-11-07 17:59 ` Mark Fasheh
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 42+ messages in thread
From: David Sterba @ 2016-11-07 14:02 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs, mark

On Sun, Nov 06, 2016 at 02:30:52PM +0100, James Pharaoh wrote:
> I'm pleased to announce my btrfs deduplication utility, written in Rust. 
> This operates on whole files, is fast, and I believe complements the 
> existing utilities (duperemove, bedup), which exist currently.

Mark can correct me if I'm wrong, but AFAIK, duperemove can consume
output of fdupes, which does the whole file scanning for duplicates. And
I think adding a whole-file dedup mode to duperemove would be better
(from user's POV) than writing a whole new tool, eg. because of existing
availability of duperemove in the distros.

Also looking to your roadmap, some of the items are implemented in
duperemove: database of existing csums, cross filesystem boundary,
mtime-based speedups).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 14:02 ` David Sterba
@ 2016-11-07 17:48   ` Mark Fasheh
  2016-11-07 20:54     ` Adam Borowski
  2016-11-08  2:40   ` Christoph Anton Mitterer
  1 sibling, 1 reply; 42+ messages in thread
From: Mark Fasheh @ 2016-11-07 17:48 UTC (permalink / raw)
  To: dsterba, James Pharaoh, linux-btrfs, Mark Fasheh

Hi David and James,

On Mon, Nov 7, 2016 at 6:02 AM, David Sterba <dsterba@suse.cz> wrote:
> On Sun, Nov 06, 2016 at 02:30:52PM +0100, James Pharaoh wrote:
>> I'm pleased to announce my btrfs deduplication utility, written in Rust.
>> This operates on whole files, is fast, and I believe complements the
>> existing utilities (duperemove, bedup), which exist currently.
>
> Mark can correct me if I'm wrong, but AFAIK, duperemove can consume
> output of fdupes, which does the whole file scanning for duplicates. And
> I think adding a whole-file dedup mode to duperemove would be better
> (from user's POV) than writing a whole new tool, eg. because of existing
> availability of duperemove in the distros.

Yeah you are correct - fdupes -r /foo | duperemove --fdupes  will get
you the same effect.

There's been a request for us to do all of that internally so that the
whole file dedupe works with the mtime checking code. This is entirely
doable. I would probably either add a field to the files table or add
a new table to hold whole-file hashes. We can then squeeze down our
existing block hashes into one big one or just rehash the whole file.


> Also looking to your roadmap, some of the items are implemented in
> duperemove: database of existing csums, cross filesystem boundary,
> mtime-based speedups).

Yeah, rescanning based on mtime was a huge speedup for Duperemove as
was keeping checksums in a db. We do all this today, also on XFS with
the dedupe ioctl (I believe this should be out with Linux-4.9).

Btw, there's lots of little details and bug fixes which I feel add up
to a relatively complete (though far from perfect!) tool. For example,
the dedupe code can handle multiple kernel versions including old
kernels which couldn't dedupe on non aligned block boundaries. Every
major step in duperemove is threaded at this point too which has also
been an enormous performance increase (which new features benefit
from).

Thanks,
    --Mark

-- 
"When the going gets weird, the weird turn pro."
Hunter S. Thompson

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-06 13:30 Announcing btrfs-dedupe James Pharaoh
  2016-11-07 14:02 ` David Sterba
@ 2016-11-07 17:59 ` Mark Fasheh
  2016-11-07 18:49   ` James Pharaoh
  2016-11-08 11:06 ` Niccolò Belli
  2016-11-08 22:36 ` Saint Germain
  3 siblings, 1 reply; 42+ messages in thread
From: Mark Fasheh @ 2016-11-07 17:59 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs

Hi James,

Re the following text on your project page:

"IMPORTANT CAVEAT — I have read that there are race and/or error
conditions which can cause filesystem corruption in the kernel
implementation of the deduplication ioctl."

Can you expound on that? I'm not aware of any bugs right now but if
there is any it'd absolutely be worth having that info on the btrfs
list.

Thanks,
    --Mark


On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh
<james@wellbehavedsoftware.com> wrote:
> Hi all,
>
> I'm pleased to announce my btrfs deduplication utility, written in Rust.
> This operates on whole files, is fast, and I believe complements the
> existing utilities (duperemove, bedup), which exist currently.
>
> Please visit the homepage for more information:
>
> http://btrfs-dedupe.com
>
> James Pharaoh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 17:59 ` Mark Fasheh
@ 2016-11-07 18:49   ` James Pharaoh
  2016-11-07 18:53     ` James Pharaoh
  2016-11-14 18:07     ` Zygo Blaxell
  0 siblings, 2 replies; 42+ messages in thread
From: James Pharaoh @ 2016-11-07 18:49 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs

Annoyingly I can't find this now, but I definitely remember reading 
someone, apparently someone knowledgable, claim that the latest version 
of the kernel which I was using at the time, still suffered from issues 
regarding the dedupe code.

This was a while ago, and I would be very pleased to hear that there is 
high confidence in the current implementation! I'll post a link if I 
manage to find the comments.

James

On 07/11/16 18:59, Mark Fasheh wrote:
> Hi James,
>
> Re the following text on your project page:
>
> "IMPORTANT CAVEAT — I have read that there are race and/or error
> conditions which can cause filesystem corruption in the kernel
> implementation of the deduplication ioctl."
>
> Can you expound on that? I'm not aware of any bugs right now but if
> there is any it'd absolutely be worth having that info on the btrfs
> list.
>
> Thanks,
>     --Mark
>
>
> On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh
> <james@wellbehavedsoftware.com> wrote:
>> Hi all,
>>
>> I'm pleased to announce my btrfs deduplication utility, written in Rust.
>> This operates on whole files, is fast, and I believe complements the
>> existing utilities (duperemove, bedup), which exist currently.
>>
>> Please visit the homepage for more information:
>>
>> http://btrfs-dedupe.com
>>
>> James Pharaoh
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 18:49   ` James Pharaoh
@ 2016-11-07 18:53     ` James Pharaoh
  2016-11-14 18:07     ` Zygo Blaxell
  1 sibling, 0 replies; 42+ messages in thread
From: James Pharaoh @ 2016-11-07 18:53 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs

FWIW I have updated my comments about duperemove and also the "caveat" 
section you mentioned in your other mail in the readme.

http://btrfs-dedupe.com

James

On 07/11/16 19:49, James Pharaoh wrote:
> Annoyingly I can't find this now, but I definitely remember reading
> someone, apparently someone knowledgable, claim that the latest version
> of the kernel which I was using at the time, still suffered from issues
> regarding the dedupe code.
>
> This was a while ago, and I would be very pleased to hear that there is
> high confidence in the current implementation! I'll post a link if I
> manage to find the comments.
>
> James
>
> On 07/11/16 18:59, Mark Fasheh wrote:
>> Hi James,
>>
>> Re the following text on your project page:
>>
>> "IMPORTANT CAVEAT — I have read that there are race and/or error
>> conditions which can cause filesystem corruption in the kernel
>> implementation of the deduplication ioctl."
>>
>> Can you expound on that? I'm not aware of any bugs right now but if
>> there is any it'd absolutely be worth having that info on the btrfs
>> list.
>>
>> Thanks,
>>     --Mark
>>
>>
>> On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh
>> <james@wellbehavedsoftware.com> wrote:
>>> Hi all,
>>>
>>> I'm pleased to announce my btrfs deduplication utility, written in Rust.
>>> This operates on whole files, is fast, and I believe complements the
>>> existing utilities (duperemove, bedup), which exist currently.
>>>
>>> Please visit the homepage for more information:
>>>
>>> http://btrfs-dedupe.com
>>>
>>> James Pharaoh
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 17:48   ` Mark Fasheh
@ 2016-11-07 20:54     ` Adam Borowski
  2016-11-08  2:17       ` Darrick J. Wong
  2016-11-09 15:02       ` David Sterba
  0 siblings, 2 replies; 42+ messages in thread
From: Adam Borowski @ 2016-11-07 20:54 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: dsterba, James Pharaoh, linux-btrfs

On Mon, Nov 07, 2016 at 09:48:41AM -0800, Mark Fasheh wrote:
> also on XFS with the dedupe ioctl (I believe this should be out with
> Linux-4.9).

It's already there in 4.9-rc1, although you need a special version of
xfsprogs (possibly already released, I didn't check).  It's an experimental
feature that needs to be enabled with "-m reflink=1".

Despite that experimental status, I'd strongly recommend James to test his
tool on xfs as well, as it's the second major implementation of this API[1].


Mark has already included XFS in documentation of duperemove, all that looks
amiss is btrfs-extent-same having an obsolete name.  But then, I never did
any non-superficial tests on XFS, beyond "seems to work".


Meow!

[1]. For some reasons zfs-on-linux guys didn't implement this yet, despite
it being an obvious thing on ZFS.
-- 
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 20:54     ` Adam Borowski
@ 2016-11-08  2:17       ` Darrick J. Wong
  2016-11-08 18:59         ` Mark Fasheh
  2016-11-09 15:02       ` David Sterba
  1 sibling, 1 reply; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-08  2:17 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Mark Fasheh, dsterba, James Pharaoh, linux-btrfs

On Mon, Nov 07, 2016 at 09:54:09PM +0100, Adam Borowski wrote:
> On Mon, Nov 07, 2016 at 09:48:41AM -0800, Mark Fasheh wrote:
> > also on XFS with the dedupe ioctl (I believe this should be out with
> > Linux-4.9).
> 
> It's already there in 4.9-rc1, although you need a special version of
> xfsprogs (possibly already released, I didn't check).  It's an experimental
> feature that needs to be enabled with "-m reflink=1".

The code will be available in xfsprogs 4.9, due out after Linux 4.9.

You'll still have to pass '-m reflink=1' to enable reflink until we
declare the feature stable, however.

> Despite that experimental status, I'd strongly recommend James to test his
> tool on xfs as well, as it's the second major implementation of this API[1].

Agreed. :)

> Mark has already included XFS in documentation of duperemove, all that looks
> amiss is btrfs-extent-same having an obsolete name.  But then, I never did
> any non-superficial tests on XFS, beyond "seems to work".

/me wonders if ocfs2 will ever catch up to the reflink/dedupe party. ;)

--Darrick

> 
> 
> Meow!
> 
> [1]. For some reasons zfs-on-linux guys didn't implement this yet, despite
> it being an obvious thing on ZFS.
> -- 
> A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
> raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
> throw away the fruits (can dump them into a cake, etc), let the drink age
> at least 3-6 months.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 14:02 ` David Sterba
  2016-11-07 17:48   ` Mark Fasheh
@ 2016-11-08  2:40   ` Christoph Anton Mitterer
  2016-11-08  6:11     ` James Pharaoh
                       ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Christoph Anton Mitterer @ 2016-11-08  2:40 UTC (permalink / raw)
  To: dsterba, James Pharaoh; +Cc: linux-btrfs, mark

[-- Attachment #1: Type: text/plain, Size: 372 bytes --]

On Mon, 2016-11-07 at 15:02 +0100, David Sterba wrote:
> I think adding a whole-file dedup mode to duperemove would be better
> (from user's POV) than writing a whole new tool

What would IMO be really good from a user's POV was, if one of the
tools, deemed to be the "best", would be added to the btrfs-progs and
simply become "the official" one.

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08  2:40   ` Christoph Anton Mitterer
@ 2016-11-08  6:11     ` James Pharaoh
  2016-11-08 13:26     ` Austin S. Hemmelgarn
  2016-11-08 18:49     ` Mark Fasheh
  2 siblings, 0 replies; 42+ messages in thread
From: James Pharaoh @ 2016-11-08  6:11 UTC (permalink / raw)
  To: Christoph Anton Mitterer, dsterba; +Cc: linux-btrfs, mark

Perhaps the complexity of doing this efficiently makes it inappropriate 
for inclusion in the tool itself, whereas I believe the core 
implementation's focus is on in-band deduplication, automatic and behind 
the scenes.

On 08/11/16 03:40, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-07 at 15:02 +0100, David Sterba wrote:
>> I think adding a whole-file dedup mode to duperemove would be better
>> (from user's POV) than writing a whole new tool
>
> What would IMO be really good from a user's POV was, if one of the
> tools, deemed to be the "best", would be added to the btrfs-progs and
> simply become "the official" one.
>
> Cheers,
> Chris.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-06 13:30 Announcing btrfs-dedupe James Pharaoh
  2016-11-07 14:02 ` David Sterba
  2016-11-07 17:59 ` Mark Fasheh
@ 2016-11-08 11:06 ` Niccolò Belli
  2016-11-08 11:38   ` James Pharaoh
  2016-11-14 18:27   ` Zygo Blaxell
  2016-11-08 22:36 ` Saint Germain
  3 siblings, 2 replies; 42+ messages in thread
From: Niccolò Belli @ 2016-11-08 11:06 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs

Nice, you should probably update the btrfs wiki as well, because there is 
no mention of btrfs-dedupe.

First question, why this name? Don't you plan to support xfs as well?

Second question, I'm trying deduplication tools for the very first time and 
I still have to figure out how to handle snapper snapshots, which are read 
only. I currently tried duperemove 0.11 git and I get tons of "Error 30: 
Read-only file system while opening "/.../@snapshots/4385/...". How am I 
supposed to handle snapper snapshots?

I do not run duperemove from a live distro, instead I run it directly on 
the system I want to deduplicate:

sudo mount -o noatime,compress=lzo,autodefrag /dev/mapper/cryptroot 
/home/niko/nosnap/rootfs/
sudo duperemove -drh --dedupe-options=nofiemap 
--hashfile=/home/niko/nosnap/rootfs.hash /home/niko/nosnap/rootfs/

Is btrfs-dedupe able to handle snapper snapshots?

Thanks,
Niccolo' Belli

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 11:06 ` Niccolò Belli
@ 2016-11-08 11:38   ` James Pharaoh
  2016-11-08 16:57     ` Niccolò Belli
  2016-11-14 18:27   ` Zygo Blaxell
  1 sibling, 1 reply; 42+ messages in thread
From: James Pharaoh @ 2016-11-08 11:38 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: linux-btrfs

On 08/11/16 12:06, Niccolò Belli wrote:
> Nice, you should probably update the btrfs wiki as well, because there
> is no mention of btrfs-dedupe.

I am planning to, I had to apply for an account, which has now been 
approved.

> First question, why this name? Don't you plan to support xfs as well?

It didn't occur to me, to be honest. I might support XFS as well, but I 
don't use it, and will possibly be adding other btrfs-specific stuff to 
it. You'll notice it's part of a bigger wbs-backup repo, with other 
tools, which I'm developing to manage my storage and backup requirements.

I'll take a look at it, and certainly see if it works out of the box.

> Second question, I'm trying deduplication tools for the very first time
> and I still have to figure out how to handle snapper snapshots, which
> are read only. I currently tried duperemove 0.11 git and I get tons of
> "Error 30: Read-only file system while opening
> "/.../@snapshots/4385/...". How am I supposed to handle snapper snapshots?

 > Is btrfs-dedupe able to handle snapper snapshots?

You can't deduplicate a read-only snapshot, but you can create 
read-write snapshots from them, deduplicate those, and then recreate the 
read-only ones. This is what I've done.

In theory, once this has been done once, it shouldn't have to be done 
again, at least for those snapshots, unless you want to modify the 
deduplication. It's probably a good idea to defragment files and 
directories first, as well.

It should be possible to deduplicate a read-only file to a read-write 
one, but that's probably not worth the effort in many real-world use cases.

James

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08  2:40   ` Christoph Anton Mitterer
  2016-11-08  6:11     ` James Pharaoh
@ 2016-11-08 13:26     ` Austin S. Hemmelgarn
  2016-11-08 16:57       ` Darrick J. Wong
  2016-11-08 18:49     ` Mark Fasheh
  2 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2016-11-08 13:26 UTC (permalink / raw)
  To: Christoph Anton Mitterer, dsterba, James Pharaoh; +Cc: linux-btrfs, mark

On 2016-11-07 21:40, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-07 at 15:02 +0100, David Sterba wrote:
>> I think adding a whole-file dedup mode to duperemove would be better
>> (from user's POV) than writing a whole new tool
>
> What would IMO be really good from a user's POV was, if one of the
> tools, deemed to be the "best", would be added to the btrfs-progs and
> simply become "the official" one.

The problem is that for deduplication, most tools won't work well for 
everything.  For example the cases I use it in are very specific and 
have horrible performance using pretty much any available tool (I have a 
couple cases where I have disjoint subsets of the same directory tree 
with different prefixes, so I can tell exactly which files are 
duplicated, and that any duplicate file is 100% duplicate, as well as a 
couple of cases where changes are small, scattered, and highly 
predictable (and thus it's easier to find what's changed and dedupe 
everything else instead of finding what's the same), and none of the 
existing options do well in either situation).

I'd argue at minimum for having the extent-same tool from duperemove in 
btrfs-progs, as that lets people do deduplication how they want without 
having to write C code.  Something equivalent that would let you call 
any BTRFS ioctl with (reasonably) arbitrary arguments might actually be 
even better (I can see such a tool being wonderful for debugging).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 13:26     ` Austin S. Hemmelgarn
@ 2016-11-08 16:57       ` Darrick J. Wong
  2016-11-08 17:04         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-08 16:57 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Christoph Anton Mitterer, dsterba, James Pharaoh, linux-btrfs, mark

On Tue, Nov 08, 2016 at 08:26:02AM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-07 21:40, Christoph Anton Mitterer wrote:
> >On Mon, 2016-11-07 at 15:02 +0100, David Sterba wrote:
> >>I think adding a whole-file dedup mode to duperemove would be better
> >>(from user's POV) than writing a whole new tool
> >
> >What would IMO be really good from a user's POV was, if one of the
> >tools, deemed to be the "best", would be added to the btrfs-progs and
> >simply become "the official" one.
> 
> The problem is that for deduplication, most tools won't work well for
> everything.  For example the cases I use it in are very specific and have
> horrible performance using pretty much any available tool (I have a couple
> cases where I have disjoint subsets of the same directory tree with
> different prefixes, so I can tell exactly which files are duplicated, and
> that any duplicate file is 100% duplicate, as well as a couple of cases
> where changes are small, scattered, and highly predictable (and thus it's
> easier to find what's changed and dedupe everything else instead of finding
> what's the same), and none of the existing options do well in either
> situation).
> 
> I'd argue at minimum for having the extent-same tool from duperemove in
> btrfs-progs, as that lets people do deduplication how they want without
> having to write C code.  Something equivalent that would let you call any
> BTRFS ioctl with (reasonably) arbitrary arguments might actually be even
> better (I can see such a tool being wonderful for debugging).

Since xfsprogs 4.3, xfs_io has a 'dedupe' command that can talk to
FIDEDUPERANGE (f.k.a. EXTENT SAME):

$ xfs_io -c '/mnt/srcfile srcoffset dstoffset length' /mnt/destfile

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 11:38   ` James Pharaoh
@ 2016-11-08 16:57     ` Niccolò Belli
  2016-11-08 16:58       ` James Pharaoh
  0 siblings, 1 reply; 42+ messages in thread
From: Niccolò Belli @ 2016-11-08 16:57 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs

On martedì 8 novembre 2016 12:38:48 CET, James Pharaoh wrote:
> You can't deduplicate a read-only snapshot, but you can create 
> read-write snapshots from them, deduplicate those, and then 
> recreate the read-only ones. This is what I've done.

Since snapper creates hundreds of snapshots, isn't this something that the 
deduplication software could do for me if I explicitely tell it to do so? I 
mean momentarily switching the snapshot to rw in order to deduplicate it, 
then switching it back to ro.

> In theory, once this has been done once, it shouldn't have to 
> be done again, at least for those snapshots, unless you want to 
> modify the deduplication. It's probably a good idea to 
> defragment files and directories first, as well.

I can't defragment anything, because it would take too much free space to 
do so with so many snapshots. Instead, the deduplication software could 
defragment each file before calling the extent-same ioctl, that would be 
feasible. Such a way you will not need hilarious amounts of free space to 
defragment the fs.

> It should be possible to deduplicate a read-only file to a 
> read-write one, but that's probably not worth the effort in many 
> real-world use cases.

This is exactly what I would expect a deduplication tool to do when it 
encounters a ro snapshot, except when I explicitely tell it to momentarily 
switch the snapshot to rw in order to deduplicate it.

Niccolo' Belli

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 16:57     ` Niccolò Belli
@ 2016-11-08 16:58       ` James Pharaoh
  2016-11-08 17:08         ` Niccolò Belli
  0 siblings, 1 reply; 42+ messages in thread
From: James Pharaoh @ 2016-11-08 16:58 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: linux-btrfs

Yes, everything you have described here is something I intend to create, 
and might as well include in the tool itself. I'll add it to the roadmap ;-)

James

On 08/11/16 17:57, Niccolò Belli wrote:
> On martedì 8 novembre 2016 12:38:48 CET, James Pharaoh wrote:
>> You can't deduplicate a read-only snapshot, but you can create
>> read-write snapshots from them, deduplicate those, and then recreate
>> the read-only ones. This is what I've done.
>
> Since snapper creates hundreds of snapshots, isn't this something that
> the deduplication software could do for me if I explicitely tell it to
> do so? I mean momentarily switching the snapshot to rw in order to
> deduplicate it, then switching it back to ro.
>
>> In theory, once this has been done once, it shouldn't have to be done
>> again, at least for those snapshots, unless you want to modify the
>> deduplication. It's probably a good idea to defragment files and
>> directories first, as well.
>
> I can't defragment anything, because it would take too much free space
> to do so with so many snapshots. Instead, the deduplication software
> could defragment each file before calling the extent-same ioctl, that
> would be feasible. Such a way you will not need hilarious amounts of
> free space to defragment the fs.
>
>> It should be possible to deduplicate a read-only file to a read-write
>> one, but that's probably not worth the effort in many real-world use
>> cases.
>
> This is exactly what I would expect a deduplication tool to do when it
> encounters a ro snapshot, except when I explicitely tell it to
> momentarily switch the snapshot to rw in order to deduplicate it.
>
> Niccolo' Belli

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 16:57       ` Darrick J. Wong
@ 2016-11-08 17:04         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2016-11-08 17:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Anton Mitterer, dsterba, James Pharaoh, linux-btrfs, mark

On 2016-11-08 11:57, Darrick J. Wong wrote:
> On Tue, Nov 08, 2016 at 08:26:02AM -0500, Austin S. Hemmelgarn wrote:
>> On 2016-11-07 21:40, Christoph Anton Mitterer wrote:
>>> On Mon, 2016-11-07 at 15:02 +0100, David Sterba wrote:
>>>> I think adding a whole-file dedup mode to duperemove would be better
>>>> (from user's POV) than writing a whole new tool
>>>
>>> What would IMO be really good from a user's POV was, if one of the
>>> tools, deemed to be the "best", would be added to the btrfs-progs and
>>> simply become "the official" one.
>>
>> The problem is that for deduplication, most tools won't work well for
>> everything.  For example the cases I use it in are very specific and have
>> horrible performance using pretty much any available tool (I have a couple
>> cases where I have disjoint subsets of the same directory tree with
>> different prefixes, so I can tell exactly which files are duplicated, and
>> that any duplicate file is 100% duplicate, as well as a couple of cases
>> where changes are small, scattered, and highly predictable (and thus it's
>> easier to find what's changed and dedupe everything else instead of finding
>> what's the same), and none of the existing options do well in either
>> situation).
>>
>> I'd argue at minimum for having the extent-same tool from duperemove in
>> btrfs-progs, as that lets people do deduplication how they want without
>> having to write C code.  Something equivalent that would let you call any
>> BTRFS ioctl with (reasonably) arbitrary arguments might actually be even
>> better (I can see such a tool being wonderful for debugging).
>
> Since xfsprogs 4.3, xfs_io has a 'dedupe' command that can talk to
> FIDEDUPERANGE (f.k.a. EXTENT SAME):
>
> $ xfs_io -c '/mnt/srcfile srcoffset dstoffset length' /mnt/destfile
>
I actually hadn't known about this, thanks.  It means that xfs_io just 
got even more useful despite me not running XFS.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 16:58       ` James Pharaoh
@ 2016-11-08 17:08         ` Niccolò Belli
  0 siblings, 0 replies; 42+ messages in thread
From: Niccolò Belli @ 2016-11-08 17:08 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs

On martedì 8 novembre 2016 17:58:52 CET, James Pharaoh wrote:
> Yes, everything you have described here is something I intend 
> to create, and might as well include in the tool itself. I'll 
> add it to the roadmap ;-)

Sounds good, but I have yet another feature request which is even more 
interesting in my opinion.
If you ever used snapper you probably already found yourself in the 
poisition when you want to free some space and you actually can't, because 
the files you want to delete are already present in countless snapshots. 
Such a way you will have to delete the unwanted files from every snapshot, 
which is tedious task, even more difficult if you moved/renamed these 
files. What I actually do is exploiting duperemove's hashfile to grep for 
the checksum and obtain all the paths. Then I will have to switch the 
snapshots to rw, manually delete each file and finally switch them back to 
ro. A tool which automates these task would be awesome.

Niccolo'

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08  2:40   ` Christoph Anton Mitterer
  2016-11-08  6:11     ` James Pharaoh
  2016-11-08 13:26     ` Austin S. Hemmelgarn
@ 2016-11-08 18:49     ` Mark Fasheh
  2 siblings, 0 replies; 42+ messages in thread
From: Mark Fasheh @ 2016-11-08 18:49 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: dsterba, James Pharaoh, linux-btrfs

On Mon, Nov 7, 2016 at 6:40 PM, Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
> On Mon, 2016-11-07 at 15:02 +0100, David Sterba wrote:
>> I think adding a whole-file dedup mode to duperemove would be better
>> (from user's POV) than writing a whole new tool
>
> What would IMO be really good from a user's POV was, if one of the
> tools, deemed to be the "best", would be added to the btrfs-progs and
> simply become "the official" one.

Yeah there's two problems, one being that the extent-same ioctl (and
duperemove) is cross-file system now so I. The other one James touches
on, which is that there's a non trivial amount of complexity in
duperemove so shoving it in btrfs progs just means we're going to have
parallel development streams solving some different problems.

That's not to say that every dedupe tool has to be complex - we have
xfs_io to run the ioctl and I don't think it'd be a bad idea if
btrfs-progs had a simple interface to it too.
   --Mark



-- 
"When the going gets weird, the weird turn pro."
Hunter S. Thompson

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08  2:17       ` Darrick J. Wong
@ 2016-11-08 18:59         ` Mark Fasheh
  2016-11-08 19:47             ` [Ocfs2-devel] " Darrick J. Wong
  0 siblings, 1 reply; 42+ messages in thread
From: Mark Fasheh @ 2016-11-08 18:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Adam Borowski, dsterba, James Pharaoh, linux-btrfs

On Mon, Nov 7, 2016 at 6:17 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Mon, Nov 07, 2016 at 09:54:09PM +0100, Adam Borowski wrote:
>> Mark has already included XFS in documentation of duperemove, all that looks
>> amiss is btrfs-extent-same having an obsolete name.  But then, I never did
>> any non-superficial tests on XFS, beyond "seems to work".

I'd actually be ok dropping btrfs-extent-same completely at this point
but I'm concerned that it would leave some users behind.


> /me wonders if ocfs2 will ever catch up to the reflink/dedupe party. ;)

Hey, Ocfs2 started the reflink party! But yeah it's fallen behind
since then with respect to cow and dedupe. More importantly though I'd
like to see some extra extent tracking in there like XFS did with the
reflink b+tree.
   --Mark

-- 
"When the going gets weird, the weird turn pro."
Hunter S. Thompson

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 18:59         ` Mark Fasheh
@ 2016-11-08 19:47             ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-08 19:47 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Adam Borowski, dsterba, James Pharaoh, linux-btrfs, ocfs2-devel

On Tue, Nov 08, 2016 at 10:59:56AM -0800, Mark Fasheh wrote:
> On Mon, Nov 7, 2016 at 6:17 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Mon, Nov 07, 2016 at 09:54:09PM +0100, Adam Borowski wrote:
> >> Mark has already included XFS in documentation of duperemove, all that looks
> >> amiss is btrfs-extent-same having an obsolete name.  But then, I never did
> >> any non-superficial tests on XFS, beyond "seems to work".
> 
> I'd actually be ok dropping btrfs-extent-same completely at this point
> but I'm concerned that it would leave some users behind.
> 
> 
> > /me wonders if ocfs2 will ever catch up to the reflink/dedupe party. ;)
> 
> Hey, Ocfs2 started the reflink party! But yeah it's fallen behind
> since then with respect to cow and dedupe. More importantly though I'd
> like to see some extra extent tracking in there like XFS did with the
> reflink b+tree.

Perhaps this should move to the ocfs2 list, but...

...as I understand ocfs2, each inode can point to the head of a refcount
tree that maintains refcounts for all the physical blocks that are
mapped by any of the files that share that refcount tree.  It wouldn't
be difficult to hook up this existing refcount structure to the reflink
and dedupe vfs ioctls, with the huge caveat that both inodes will end up
belonging to the same refcount tree (or the call fails).  This might not
be such a huge issue for reflink since we're generally only using it
during a file copy anyway, but for dedupe this could have disastrous
consequences if someone does an fs-wide dedupe and every file in the fs
ends up with the same refcount tree.

So I guess you could give each block group its own refcount tree or
something so that all the writes in the fs don't end up contending for a
single data structure.

--D

>    --Mark
> 
> -- 
> "When the going gets weird, the weird turn pro."
> Hunter S. Thompson
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] Announcing btrfs-dedupe
@ 2016-11-08 19:47             ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-08 19:47 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Adam Borowski, dsterba, James Pharaoh, linux-btrfs, ocfs2-devel

On Tue, Nov 08, 2016 at 10:59:56AM -0800, Mark Fasheh wrote:
> On Mon, Nov 7, 2016 at 6:17 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Mon, Nov 07, 2016 at 09:54:09PM +0100, Adam Borowski wrote:
> >> Mark has already included XFS in documentation of duperemove, all that looks
> >> amiss is btrfs-extent-same having an obsolete name.  But then, I never did
> >> any non-superficial tests on XFS, beyond "seems to work".
> 
> I'd actually be ok dropping btrfs-extent-same completely at this point
> but I'm concerned that it would leave some users behind.
> 
> 
> > /me wonders if ocfs2 will ever catch up to the reflink/dedupe party. ;)
> 
> Hey, Ocfs2 started the reflink party! But yeah it's fallen behind
> since then with respect to cow and dedupe. More importantly though I'd
> like to see some extra extent tracking in there like XFS did with the
> reflink b+tree.

Perhaps this should move to the ocfs2 list, but...

...as I understand ocfs2, each inode can point to the head of a refcount
tree that maintains refcounts for all the physical blocks that are
mapped by any of the files that share that refcount tree.  It wouldn't
be difficult to hook up this existing refcount structure to the reflink
and dedupe vfs ioctls, with the huge caveat that both inodes will end up
belonging to the same refcount tree (or the call fails).  This might not
be such a huge issue for reflink since we're generally only using it
during a file copy anyway, but for dedupe this could have disastrous
consequences if someone does an fs-wide dedupe and every file in the fs
ends up with the same refcount tree.

So I guess you could give each block group its own refcount tree or
something so that all the writes in the fs don't end up contending for a
single data structure.

--D

>    --Mark
> 
> -- 
> "When the going gets weird, the weird turn pro."
> Hunter S. Thompson
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-06 13:30 Announcing btrfs-dedupe James Pharaoh
                   ` (2 preceding siblings ...)
  2016-11-08 11:06 ` Niccolò Belli
@ 2016-11-08 22:36 ` Saint Germain
  2016-11-09 11:24   ` Niccolò Belli
  2016-11-13 12:45   ` James Pharaoh
  3 siblings, 2 replies; 42+ messages in thread
From: Saint Germain @ 2016-11-08 22:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: James Pharaoh

On Sun, 6 Nov 2016 14:30:52 +0100, James Pharaoh
<james@wellbehavedsoftware.com> wrote :

> Hi all,
> 
> I'm pleased to announce my btrfs deduplication utility, written in
> Rust. This operates on whole files, is fast, and I believe
> complements the existing utilities (duperemove, bedup), which exist
> currently.
> 
> Please visit the homepage for more information:
> 
> http://btrfs-dedupe.com
> 

Thanks for having shared your work.
Please be aware of these other similar softwares:
- jdupes: https://github.com/jbruchon/jdupes
- rmlint: https://github.com/sahib/rmlint
And of course fdupes.

Some intesting points I have seen in them:
- use xxhash to identify potential duplicates (huge speedup)
- ability to deduplicate read-only snapshots
- identify potential reflinked files (see also my email here:
  https://www.spinics.net/lists/linux-btrfs/msg60081.html)
- ability to filter out hardlinks
- triangle problem: see jdupes readme
- jdupes has started the process to be included in Debian

I hope that will help and that you can share some codes with them !


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 22:36 ` Saint Germain
@ 2016-11-09 11:24   ` Niccolò Belli
  2016-11-09 12:47     ` Saint Germain
  2016-11-13 12:45   ` James Pharaoh
  1 sibling, 1 reply; 42+ messages in thread
From: Niccolò Belli @ 2016-11-09 11:24 UTC (permalink / raw)
  To: Saint Germain; +Cc: linux-btrfs, James Pharaoh

Hi,
What do you think about jdupes? I'm searching an alternative to duperemove 
and rmlint doesn't seem to support btrfs deduplication, so I would like to 
try jdupes. My main problem with duperemove is a memory leak, also it seems 
to lead to greater disk usage: 
https://github.com/markfasheh/duperemove/issues/163

Niccolo' Belli

On martedì 8 novembre 2016 23:36:25 CET, Saint Germain wrote:
> Please be aware of these other similar softwares:
> - jdupes: https://github.com/jbruchon/jdupes
> - rmlint: https://github.com/sahib/rmlint
> And of course fdupes.
>
> Some intesting points I have seen in them:
> - use xxhash to identify potential duplicates (huge speedup)
> - ability to deduplicate read-only snapshots
> - identify potential reflinked files (see also my email here:
>   https://www.spinics.net/lists/linux-btrfs/msg60081.html)
> - ability to filter out hardlinks
> - triangle problem: see jdupes readme
> - jdupes has started the process to be included in Debian
>
> I hope that will help and that you can share some codes with them !

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-09 11:24   ` Niccolò Belli
@ 2016-11-09 12:47     ` Saint Germain
  0 siblings, 0 replies; 42+ messages in thread
From: Saint Germain @ 2016-11-09 12:47 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Niccolò Belli

On Wed, 09 Nov 2016 12:24:51 +0100, Niccolò Belli
<darkbasic@linuxsystems.it> wrote :
> 
> On martedì 8 novembre 2016 23:36:25 CET, Saint Germain wrote:
> > Please be aware of these other similar softwares:
> > - jdupes: https://github.com/jbruchon/jdupes
> > - rmlint: https://github.com/sahib/rmlint
> > And of course fdupes.
> >
> > Some intesting points I have seen in them:
> > - use xxhash to identify potential duplicates (huge speedup)
> > - ability to deduplicate read-only snapshots
> > - identify potential reflinked files (see also my email here:
> >   https://www.spinics.net/lists/linux-btrfs/msg60081.html)
> > - ability to filter out hardlinks
> > - triangle problem: see jdupes readme
> > - jdupes has started the process to be included in Debian
> >
> > I hope that will help and that you can share some codes with them !
> > 
> Hi,
> What do you think about jdupes? I'm searching an alternative to
> duperemove and rmlint doesn't seem to support btrfs deduplication, so
> I would like to try jdupes. My main problem with duperemove is a
> memory leak, also it seems to lead to greater disk usage: 
> https://github.com/markfasheh/duperemove/issues/163

rmlint is supporting btrfs deduplication:
rmlint --algorithm=xxhash --types="duplicates" --hidden --config=sh:handler=clone --no-hardlinked

I've used jdupes and rmlint to deduplicate 2TB with 4GB RAM and it took
a few hours. So it is acceptable from a performance point of view.
The problems I found have been corrected by both.

Jdupes author is really kind and reactive !

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 20:54     ` Adam Borowski
  2016-11-08  2:17       ` Darrick J. Wong
@ 2016-11-09 15:02       ` David Sterba
  1 sibling, 0 replies; 42+ messages in thread
From: David Sterba @ 2016-11-09 15:02 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Mark Fasheh, dsterba, James Pharaoh, linux-btrfs

On Mon, Nov 07, 2016 at 09:54:09PM +0100, Adam Borowski wrote:
> [1]. For some reasons zfs-on-linux guys didn't implement this yet, despite
> it being an obvious thing on ZFS.

In my understanding, the COW mechanics are different, there are no
extent back references, so this would require some design updates. See
issue 405 at ZoL tracker.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 22:36 ` Saint Germain
  2016-11-09 11:24   ` Niccolò Belli
@ 2016-11-13 12:45   ` James Pharaoh
  1 sibling, 0 replies; 42+ messages in thread
From: James Pharaoh @ 2016-11-13 12:45 UTC (permalink / raw)
  To: Saint Germain, linux-btrfs

I've updated the BTRFS wiki here with all the new tools people have 
mentioned:

https://btrfs.wiki.kernel.org/index.php/Deduplication#Other_tools

Please let me know if anyone who does not have access to the wiki has 
any additions, updates or corrections to what I've written here.

James

On 08/11/16 23:36, Saint Germain wrote:
> On Sun, 6 Nov 2016 14:30:52 +0100, James Pharaoh
> <james@wellbehavedsoftware.com> wrote :
>
>> Hi all,
>>
>> I'm pleased to announce my btrfs deduplication utility, written in
>> Rust. This operates on whole files, is fast, and I believe
>> complements the existing utilities (duperemove, bedup), which exist
>> currently.
>>
>> Please visit the homepage for more information:
>>
>> http://btrfs-dedupe.com
>>
>
> Thanks for having shared your work.
> Please be aware of these other similar softwares:
> - jdupes: https://github.com/jbruchon/jdupes
> - rmlint: https://github.com/sahib/rmlint
> And of course fdupes.
>
> Some intesting points I have seen in them:
> - use xxhash to identify potential duplicates (huge speedup)
> - ability to deduplicate read-only snapshots
> - identify potential reflinked files (see also my email here:
>   https://www.spinics.net/lists/linux-btrfs/msg60081.html)
> - ability to filter out hardlinks
> - triangle problem: see jdupes readme
> - jdupes has started the process to be included in Debian
>
> I hope that will help and that you can share some codes with them !
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-07 18:49   ` James Pharaoh
  2016-11-07 18:53     ` James Pharaoh
@ 2016-11-14 18:07     ` Zygo Blaxell
  2016-11-14 18:22       ` James Pharaoh
  1 sibling, 1 reply; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-14 18:07 UTC (permalink / raw)
  To: James Pharaoh; +Cc: Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4115 bytes --]

On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:
> Annoyingly I can't find this now, but I definitely remember reading someone,
> apparently someone knowledgable, claim that the latest version of the kernel
> which I was using at the time, still suffered from issues regarding the
> dedupe code.

> This was a while ago, and I would be very pleased to hear that there is high
> confidence in the current implementation! I'll post a link if I manage to
> find the comments.

I've been running the btrfs dedup ioctl 7 times per second on average
over 42TB of test data for most of a year (and at a lower rate for two
years).  I have not found any data corruptions due to _dedup_.  I did find
three distinct data corruption kernel bugs unrelated to dedup, and two
test machines with bad RAM, so I'm pretty sure my corruption detection
is working.

That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
might be OK too, but only if they're up to date with backported btrfs
fixes.

Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
only deduplicate static data (i.e. data you are certain is not being
concurrently modified).  Before 3.12 there are so many bugs you might
as well not bother.

Older kernels are bad for dedup because of non-corruption reasons.
Between 3.13 and 4.4, the following bugs were fixed:

	- false-negative capability checks (e.g. same-inode, EOF extent)
	reduce dedup efficiency

	- ctime updates (older versions would update ctime when a file was
	deduped) mess with incremental backup tools, build systems, etc.

	- kernel memory leaks (self-explanatory)

	- multiple kernel hang/panic bugs (e.g. a deadlock if two threads
	try to read the same extent at the same time, and at least one
	of those threads is dedup; and there was some race condition
	leading to invalid memory access on dedup's comparison reads)
	which won't eat your data, but they might ruin your day anyway.

There is also a still-unresolved problem where the filesystem CPU usage
rises exponentially for some operations depending on the number of shared
references to an extent.  Files which contain blocks with more than a few
thousand shared references can trigger this problem.  A file over 1TB can
keep the kernel busy at 100% CPU for over 40 minutes at a time.

There might also be a correlation between delalloc data and hangs in
extent-same, but I have NOT been able to confirm this.  All I know
at this point is that doing a fsync() on the source FD just before
doing the extent-same ioctl dramatically reduces filesystem hang rates:
several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
or less without.

> James
> 
> On 07/11/16 18:59, Mark Fasheh wrote:
> >Hi James,
> >
> >Re the following text on your project page:
> >
> >"IMPORTANT CAVEAT — I have read that there are race and/or error
> >conditions which can cause filesystem corruption in the kernel
> >implementation of the deduplication ioctl."
> >
> >Can you expound on that? I'm not aware of any bugs right now but if
> >there is any it'd absolutely be worth having that info on the btrfs
> >list.
> >
> >Thanks,
> >    --Mark
> >
> >
> >On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh
> ><james@wellbehavedsoftware.com> wrote:
> >>Hi all,
> >>
> >>I'm pleased to announce my btrfs deduplication utility, written in Rust.
> >>This operates on whole files, is fast, and I believe complements the
> >>existing utilities (duperemove, bedup), which exist currently.
> >>
> >>Please visit the homepage for more information:
> >>
> >>http://btrfs-dedupe.com
> >>
> >>James Pharaoh
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 18:07     ` Zygo Blaxell
@ 2016-11-14 18:22       ` James Pharaoh
  2016-11-14 18:39         ` Austin S. Hemmelgarn
  2016-11-14 18:43         ` Zygo Blaxell
  0 siblings, 2 replies; 42+ messages in thread
From: James Pharaoh @ 2016-11-14 18:22 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Mark Fasheh, linux-btrfs

On 14/11/16 19:07, Zygo Blaxell wrote:
> On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:
>> Annoyingly I can't find this now, but I definitely remember reading someone,
>> apparently someone knowledgable, claim that the latest version of the kernel
>> which I was using at the time, still suffered from issues regarding the
>> dedupe code.
>
>> This was a while ago, and I would be very pleased to hear that there is high
>> confidence in the current implementation! I'll post a link if I manage to
>> find the comments.
>
> I've been running the btrfs dedup ioctl 7 times per second on average
> over 42TB of test data for most of a year (and at a lower rate for two
> years).  I have not found any data corruptions due to _dedup_.  I did find
> three distinct data corruption kernel bugs unrelated to dedup, and two
> test machines with bad RAM, so I'm pretty sure my corruption detection
> is working.
>
> That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
> might be OK too, but only if they're up to date with backported btrfs
> fixes.

Ok, I think this might have referred to the 4.2 kernel, which was newly 
released at the time. I wish I could find the post!

> Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
> only deduplicate static data (i.e. data you are certain is not being
> concurrently modified).  Before 3.12 there are so many bugs you might
> as well not bother.

Yes well I don't need to be told that, sadly.

> Older kernels are bad for dedup because of non-corruption reasons.
> Between 3.13 and 4.4, the following bugs were fixed:
>
> 	- false-negative capability checks (e.g. same-inode, EOF extent)
> 	reduce dedup efficiency
>
> 	- ctime updates (older versions would update ctime when a file was
> 	deduped) mess with incremental backup tools, build systems, etc.
>
> 	- kernel memory leaks (self-explanatory)
>
> 	- multiple kernel hang/panic bugs (e.g. a deadlock if two threads
> 	try to read the same extent at the same time, and at least one
> 	of those threads is dedup; and there was some race condition
> 	leading to invalid memory access on dedup's comparison reads)
> 	which won't eat your data, but they might ruin your day anyway.

Ok, I have thing I've seen some stuff like this, I certainly have 
problems, but never a loss of data. Things can take a LONG time to get 
out of the filesystem, though.

> There is also a still-unresolved problem where the filesystem CPU usage
> rises exponentially for some operations depending on the number of shared
> references to an extent.  Files which contain blocks with more than a few
> thousand shared references can trigger this problem.  A file over 1TB can
> keep the kernel busy at 100% CPU for over 40 minutes at a time.

Yes, I see this all the time. For my use cases, I don't really care 
about "shared references" as blocks of files, but am happy to simply 
deduplicate at the whole-file level. I wonder if this still will have 
the same effect, however. I guess that this could be mitigated in a 
tool, but this is going to be both annoying and not the most elegant 
solution.

> There might also be a correlation between delalloc data and hangs in
> extent-same, but I have NOT been able to confirm this.  All I know
> at this point is that doing a fsync() on the source FD just before
> doing the extent-same ioctl dramatically reduces filesystem hang rates:
> several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
> or less without.

Interesting, I'll maybe see if I can make use of this.

One thing I am keen to understand is if BTRFS will automatically ignore 
a request to deduplicate a file if it is already deduplicated? Given the 
performance I see when doing a repeat deduplication, it seems to me that 
it can't be doing so, although this could be caused by the CPU usage you 
mention above.

In any case, I'm considering some digging into the filesystem structures 
to see if I can work this out myself before i do any deduplication. I'm 
fairly sure this should be relatively simple to work out, at least well 
enough for my purposes.

James

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-08 11:06 ` Niccolò Belli
  2016-11-08 11:38   ` James Pharaoh
@ 2016-11-14 18:27   ` Zygo Blaxell
  1 sibling, 0 replies; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-14 18:27 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: James Pharaoh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 551 bytes --]

On Tue, Nov 08, 2016 at 12:06:01PM +0100, Niccolò Belli wrote:
> Nice, you should probably update the btrfs wiki as well, because there is no
> mention of btrfs-dedupe.
> 
> First question, why this name? Don't you plan to support xfs as well?

Does XFS plan to support LOGICAL_INO, INO_PATHS, and something analogous
to SEARCH_V2?

POSIX API + FILE_EXTENT_SAME is OK for the lowest common denominator
across arbitrary filesystems, but a btrfs-specific tool can do a lot
better.  Especially for incremental dedup and low-RAM algorithms.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 18:22       ` James Pharaoh
@ 2016-11-14 18:39         ` Austin S. Hemmelgarn
  2016-11-14 19:51           ` Zygo Blaxell
  2016-11-14 18:43         ` Zygo Blaxell
  1 sibling, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2016-11-14 18:39 UTC (permalink / raw)
  To: James Pharaoh, Zygo Blaxell; +Cc: Mark Fasheh, linux-btrfs

On 2016-11-14 13:22, James Pharaoh wrote:
> On 14/11/16 19:07, Zygo Blaxell wrote:
>> On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:
>>> Annoyingly I can't find this now, but I definitely remember reading
>>> someone,
>>> apparently someone knowledgable, claim that the latest version of the
>>> kernel
>>> which I was using at the time, still suffered from issues regarding the
>>> dedupe code.
>>
>>> This was a while ago, and I would be very pleased to hear that there
>>> is high
>>> confidence in the current implementation! I'll post a link if I
>>> manage to
>>> find the comments.
>>
>> I've been running the btrfs dedup ioctl 7 times per second on average
>> over 42TB of test data for most of a year (and at a lower rate for two
>> years).  I have not found any data corruptions due to _dedup_.  I did
>> find
>> three distinct data corruption kernel bugs unrelated to dedup, and two
>> test machines with bad RAM, so I'm pretty sure my corruption detection
>> is working.
>>
>> That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
>> might be OK too, but only if they're up to date with backported btrfs
>> fixes.
>
> Ok, I think this might have referred to the 4.2 kernel, which was newly
> released at the time. I wish I could find the post!
>
>> Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
>> only deduplicate static data (i.e. data you are certain is not being
>> concurrently modified).  Before 3.12 there are so many bugs you might
>> as well not bother.
>
> Yes well I don't need to be told that, sadly.
>
>> Older kernels are bad for dedup because of non-corruption reasons.
>> Between 3.13 and 4.4, the following bugs were fixed:
>>
>>     - false-negative capability checks (e.g. same-inode, EOF extent)
>>     reduce dedup efficiency
>>
>>     - ctime updates (older versions would update ctime when a file was
>>     deduped) mess with incremental backup tools, build systems, etc.
>>
>>     - kernel memory leaks (self-explanatory)
>>
>>     - multiple kernel hang/panic bugs (e.g. a deadlock if two threads
>>     try to read the same extent at the same time, and at least one
>>     of those threads is dedup; and there was some race condition
>>     leading to invalid memory access on dedup's comparison reads)
>>     which won't eat your data, but they might ruin your day anyway.
>
> Ok, I have thing I've seen some stuff like this, I certainly have
> problems, but never a loss of data. Things can take a LONG time to get
> out of the filesystem, though.
>
>> There is also a still-unresolved problem where the filesystem CPU usage
>> rises exponentially for some operations depending on the number of shared
>> references to an extent.  Files which contain blocks with more than a few
>> thousand shared references can trigger this problem.  A file over 1TB can
>> keep the kernel busy at 100% CPU for over 40 minutes at a time.
>
> Yes, I see this all the time. For my use cases, I don't really care
> about "shared references" as blocks of files, but am happy to simply
> deduplicate at the whole-file level. I wonder if this still will have
> the same effect, however. I guess that this could be mitigated in a
> tool, but this is going to be both annoying and not the most elegant
> solution.
The issue is at the extent level, so it will impact whole files too (but 
it will have less impact on defragmented files that are then 
deduplicated as whole files).  Pretty much anything that pins references 
to extents will impact this, so cloned extents and snapshots will also 
have an impact.
>
>> There might also be a correlation between delalloc data and hangs in
>> extent-same, but I have NOT been able to confirm this.  All I know
>> at this point is that doing a fsync() on the source FD just before
>> doing the extent-same ioctl dramatically reduces filesystem hang rates:
>> several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
>> or less without.
>
> Interesting, I'll maybe see if I can make use of this.
>
> One thing I am keen to understand is if BTRFS will automatically ignore
> a request to deduplicate a file if it is already deduplicated? Given the
> performance I see when doing a repeat deduplication, it seems to me that
> it can't be doing so, although this could be caused by the CPU usage you
> mention above.
What's happening is that the dedupe ioctl does a byte-wise comparison of 
the ranges to make sure they're the same before linking them.  This is 
actually what takes most of the time when calling the ioctl, and is part 
of why it takes longer the larger the range to deduplicate is.  In 
essence, it's behaving like an OS should and not trusting userspace to 
make reasonable requests (which is also why there's a separate ioctl to 
clone a range from another file instead of deduplicating existing data).

TBH, even though it's kind of annoying from a performance perspective, 
it's a rather nice safety net to have.  For example, one of the cases 
where I do deduplication is a couple of directories where each directory 
is an overlapping partial subset of one large tree which I keep 
elsewhere.  In this case, I can tell just by filename exactly what files 
might be duplicates, so the ioctl's check lets me just call the ioctl on 
all potential duplicates (after checking size, no point in wasting time 
if the files obviously aren't duplicates), and have it figure out 
whether or not they can be deduplicated.
>
> In any case, I'm considering some digging into the filesystem structures
> to see if I can work this out myself before i do any deduplication. I'm
> fairly sure this should be relatively simple to work out, at least well
> enough for my purposes.
Sadly, there's no way to avoid doing so right now.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 18:22       ` James Pharaoh
  2016-11-14 18:39         ` Austin S. Hemmelgarn
@ 2016-11-14 18:43         ` Zygo Blaxell
  1 sibling, 0 replies; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-14 18:43 UTC (permalink / raw)
  To: James Pharaoh; +Cc: Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2736 bytes --]

On Mon, Nov 14, 2016 at 07:22:59PM +0100, James Pharaoh wrote:
> On 14/11/16 19:07, Zygo Blaxell wrote:
> >There is also a still-unresolved problem where the filesystem CPU usage
> >rises exponentially for some operations depending on the number of shared
> >references to an extent.  Files which contain blocks with more than a few
> >thousand shared references can trigger this problem.  A file over 1TB can
> >keep the kernel busy at 100% CPU for over 40 minutes at a time.
> 
> Yes, I see this all the time. For my use cases, I don't really care about
> "shared references" as blocks of files, but am happy to simply deduplicate
> at the whole-file level. I wonder if this still will have the same effect,
> however. I guess that this could be mitigated in a tool, but this is going
> to be both annoying and not the most elegant solution.

If you have huge files (1TB+) this can be a problem even with whole-file
deduplications (which are really just extent-level deduplications applied
to the entire file).  The CPU time is a product of file size and extent
reference count with some other multipliers on top.

I've hacked around it by timing how long it takes to manipulate the data,
and blacklisting any hash value or block address that takes more than
10 seconds to process (if such a block is found after blacklisting, just
skip processing the block/extent/file entirely).  It turns out there are
very few of these in practice (only a few hundred per TB) but these few
hundred block hash values occur millions of times in a large data corpus.

> One thing I am keen to understand is if BTRFS will automatically ignore a
> request to deduplicate a file if it is already deduplicated? Given the
> performance I see when doing a repeat deduplication, it seems to me that it
> can't be doing so, although this could be caused by the CPU usage you
> mention above.

As far as I can tell btrfs doesn't do anything different in this
case--it'll happily repeat the entire lock/read/compare/delete/insert
sequence even if the outcome cannot be different from the initial
conditions.  Due to limitations of VFS caching it'll read the same blocks
from storage hardware twice, too.

> In any case, I'm considering some digging into the filesystem structures to
> see if I can work this out myself before i do any deduplication. I'm fairly
> sure this should be relatively simple to work out, at least well enough for
> my purposes.

I used FIEMAP (then later replaced it with SEARCH_V2 for speed) to map
the extents to physical addresses before deduping them.  If you're only
going to do whole-file dedup then you only need to care about the physical
address of the first non-hole extent.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 18:39         ` Austin S. Hemmelgarn
@ 2016-11-14 19:51           ` Zygo Blaxell
  2016-11-14 19:56             ` Austin S. Hemmelgarn
  2016-11-14 20:07             ` James Pharaoh
  0 siblings, 2 replies; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-14 19:51 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: James Pharaoh, Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]

On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 13:22, James Pharaoh wrote:
> >One thing I am keen to understand is if BTRFS will automatically ignore
> >a request to deduplicate a file if it is already deduplicated? Given the
> >performance I see when doing a repeat deduplication, it seems to me that
> >it can't be doing so, although this could be caused by the CPU usage you
> >mention above.
> What's happening is that the dedupe ioctl does a byte-wise comparison of the
> ranges to make sure they're the same before linking them.  This is actually
> what takes most of the time when calling the ioctl, and is part of why it
> takes longer the larger the range to deduplicate is.  In essence, it's
> behaving like an OS should and not trusting userspace to make reasonable
> requests (which is also why there's a separate ioctl to clone a range from
> another file instead of deduplicating existing data).

Deduplicating an extent that may might be concurrently modified during the
dedup is a reasonable userspace request.  In the general case there's
no way for userspace to ensure that it's not happening.

That said, some optimization is possible (although there are good reasons
not to bother with optimization in the kernel):

	- VFS could recognize when it has two separate references to
	the same physical extent and not re-read the same data twice
	(but that requires teaching VFS how to do CoW in general, and is
	hard for political reasons on top of the obvious technical ones).

	- the extent-same ioctl could check to see which extents
	are referenced by the src and dst ranges, and return success
	immediately without reading data if they are the same (but
	userspace should already know this, or it's wasting a huge amount
	of time before it even calls the kernel).

> TBH, even though it's kind of annoying from a performance perspective, it's
> a rather nice safety net to have.  For example, one of the cases where I do
> deduplication is a couple of directories where each directory is an
> overlapping partial subset of one large tree which I keep elsewhere.  In
> this case, I can tell just by filename exactly what files might be
> duplicates, so the ioctl's check lets me just call the ioctl on all
> potential duplicates (after checking size, no point in wasting time if the
> files obviously aren't duplicates), and have it figure out whether or not
> they can be deduplicated.
> >
> >In any case, I'm considering some digging into the filesystem structures
> >to see if I can work this out myself before i do any deduplication. I'm
> >fairly sure this should be relatively simple to work out, at least well
> >enough for my purposes.
> Sadly, there's no way to avoid doing so right now.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 19:51           ` Zygo Blaxell
@ 2016-11-14 19:56             ` Austin S. Hemmelgarn
  2016-11-14 21:10               ` Zygo Blaxell
  2016-11-14 20:07             ` James Pharaoh
  1 sibling, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2016-11-14 19:56 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: James Pharaoh, Mark Fasheh, linux-btrfs

On 2016-11-14 14:51, Zygo Blaxell wrote:
> On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
>> On 2016-11-14 13:22, James Pharaoh wrote:
>>> One thing I am keen to understand is if BTRFS will automatically ignore
>>> a request to deduplicate a file if it is already deduplicated? Given the
>>> performance I see when doing a repeat deduplication, it seems to me that
>>> it can't be doing so, although this could be caused by the CPU usage you
>>> mention above.
>> What's happening is that the dedupe ioctl does a byte-wise comparison of the
>> ranges to make sure they're the same before linking them.  This is actually
>> what takes most of the time when calling the ioctl, and is part of why it
>> takes longer the larger the range to deduplicate is.  In essence, it's
>> behaving like an OS should and not trusting userspace to make reasonable
>> requests (which is also why there's a separate ioctl to clone a range from
>> another file instead of deduplicating existing data).
>
> Deduplicating an extent that may might be concurrently modified during the
> dedup is a reasonable userspace request.  In the general case there's
> no way for userspace to ensure that it's not happening.
I'm not even talking about the locking, I'm talking about the data 
comparison that the ioctl does to ensure they are the same before 
deduplicating them, and specifically that protecting against userspace 
just passing in two random extents that happen to be the same size but 
not contain the same data (because deduplication _should_ reject such a 
situation, that's what the clone ioctl is for).

The locking is perfectly reasonable and shouldn't contribute that much 
to the overhead (unless you're being crazy and deduplicating thousands 
of tiny blocks of data).
>
> That said, some optimization is possible (although there are good reasons
> not to bother with optimization in the kernel):
>
> 	- VFS could recognize when it has two separate references to
> 	the same physical extent and not re-read the same data twice
> 	(but that requires teaching VFS how to do CoW in general, and is
> 	hard for political reasons on top of the obvious technical ones).
>
> 	- the extent-same ioctl could check to see which extents
> 	are referenced by the src and dst ranges, and return success
> 	immediately without reading data if they are the same (but
> 	userspace should already know this, or it's wasting a huge amount
> 	of time before it even calls the kernel).
>
>> TBH, even though it's kind of annoying from a performance perspective, it's
>> a rather nice safety net to have.  For example, one of the cases where I do
>> deduplication is a couple of directories where each directory is an
>> overlapping partial subset of one large tree which I keep elsewhere.  In
>> this case, I can tell just by filename exactly what files might be
>> duplicates, so the ioctl's check lets me just call the ioctl on all
>> potential duplicates (after checking size, no point in wasting time if the
>> files obviously aren't duplicates), and have it figure out whether or not
>> they can be deduplicated.
>>>
>>> In any case, I'm considering some digging into the filesystem structures
>>> to see if I can work this out myself before i do any deduplication. I'm
>>> fairly sure this should be relatively simple to work out, at least well
>>> enough for my purposes.
>> Sadly, there's no way to avoid doing so right now.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 19:51           ` Zygo Blaxell
  2016-11-14 19:56             ` Austin S. Hemmelgarn
@ 2016-11-14 20:07             ` James Pharaoh
  2016-11-14 21:22               ` Zygo Blaxell
  1 sibling, 1 reply; 42+ messages in thread
From: James Pharaoh @ 2016-11-14 20:07 UTC (permalink / raw)
  To: Zygo Blaxell, Austin S. Hemmelgarn; +Cc: Mark Fasheh, linux-btrfs

On 14/11/16 20:51, Zygo Blaxell wrote:
> On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
>> On 2016-11-14 13:22, James Pharaoh wrote:
>>> One thing I am keen to understand is if BTRFS will automatically ignore
>>> a request to deduplicate a file if it is already deduplicated? Given the
>>> performance I see when doing a repeat deduplication, it seems to me that
>>> it can't be doing so, although this could be caused by the CPU usage you
>>> mention above.
 >>
>> What's happening is that the dedupe ioctl does a byte-wise comparison of the
>> ranges to make sure they're the same before linking them.  This is actually
>> what takes most of the time when calling the ioctl, and is part of why it
>> takes longer the larger the range to deduplicate is.  In essence, it's
>> behaving like an OS should and not trusting userspace to make reasonable
>> requests (which is also why there's a separate ioctl to clone a range from
>> another file instead of deduplicating existing data).
>
> 	- the extent-same ioctl could check to see which extents
> 	are referenced by the src and dst ranges, and return success
> 	immediately without reading data if they are the same (but
> 	userspace should already know this, or it's wasting a huge amount
> 	of time before it even calls the kernel).

Yes, this is what I am talking about. I believe I should be able to read 
data about the BTRFS data structures and determine if this is the case. 
I don't care if there are false matches, due to concurrent updates, but 
there'll be a /lot/ of repeat deduplications unless I do this, because 
even if the file is identical, the mtime etc hasn't changed, and I have 
a record of previously doing a dedupe, there's no guarantee that the 
file hasn't been rewritten in place (eg by rsync), and no way that I 
know of to reliably detect if a file has been changed.

I am sure there are libraries out there which can look into the data 
structures of a BTRFS file system, I haven't researched this in detail 
though. I imagine that with some kind of lock on a BTRFS root, this 
could be achieved by simply reading the data from the disk, since I 
believe that everything is copy-on-write, so no existing data should be 
overwritten until all roots referring to it are updated. Perhaps I'm 
missing something though...

James

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 19:56             ` Austin S. Hemmelgarn
@ 2016-11-14 21:10               ` Zygo Blaxell
  2016-11-15 12:26                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-14 21:10 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: James Pharaoh, Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4307 bytes --]

On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 14:51, Zygo Blaxell wrote:
> >Deduplicating an extent that may might be concurrently modified during the
> >dedup is a reasonable userspace request.  In the general case there's
> >no way for userspace to ensure that it's not happening.
> I'm not even talking about the locking, I'm talking about the data
> comparison that the ioctl does to ensure they are the same before
> deduplicating them, and specifically that protecting against userspace just
> passing in two random extents that happen to be the same size but not
> contain the same data (because deduplication _should_ reject such a
> situation, that's what the clone ioctl is for).

If I'm deduping a VM image, and the virtual host is writing to said image
(which is likely since an incremental dedup will be intentionally doing
dedup over recently active data sets), the extent I just compared in
userspace might be different by the time the kernel sees it.

This is an important reason why the whole lock/read/compare/replace step
is an atomic operation from userspace's PoV.

The read also saves having to confirm a short/weak hash isn't a collision.
The RAM savings from using weak hashes (~48 bits) are a huge performance
win.

The locking overhead is very small compared to the reading overhead,
and (in the absence of bugs) it will only block concurrent writes to the
same offset range in the src/dst inodes (based on a read of the code...I
don't know if there's also an inode-level or backref-level barrier that
expands the locking scope).

I'm not sure the ioctl is well designed for simply throwing random
data at it, especially not entire files (it can't handle files over
16MB anyway).  It will read more data than it has to compared to a
block-by-block comparison from userspace with prefetches or a pair of
IO threads.  If userspace reads both copies of the data just before
issuing the extent-same call, the kernel will read the data from cache
reasonably quickly.

> The locking is perfectly reasonable and shouldn't contribute that much to
> the overhead (unless you're being crazy and deduplicating thousands of tiny
> blocks of data).

Why is deduplicating thousands of blocks of data crazy?  I already
deduplicate four orders of magnitude more than that per week.

> >That said, some optimization is possible (although there are good reasons
> >not to bother with optimization in the kernel):
> >
> >	- VFS could recognize when it has two separate references to
> >	the same physical extent and not re-read the same data twice
> >	(but that requires teaching VFS how to do CoW in general, and is
> >	hard for political reasons on top of the obvious technical ones).
> >
> >	- the extent-same ioctl could check to see which extents
> >	are referenced by the src and dst ranges, and return success
> >	immediately without reading data if they are the same (but
> >	userspace should already know this, or it's wasting a huge amount
> >	of time before it even calls the kernel).
> >
> >>TBH, even though it's kind of annoying from a performance perspective, it's
> >>a rather nice safety net to have.  For example, one of the cases where I do
> >>deduplication is a couple of directories where each directory is an
> >>overlapping partial subset of one large tree which I keep elsewhere.  In
> >>this case, I can tell just by filename exactly what files might be
> >>duplicates, so the ioctl's check lets me just call the ioctl on all
> >>potential duplicates (after checking size, no point in wasting time if the
> >>files obviously aren't duplicates), and have it figure out whether or not
> >>they can be deduplicated.
> >>>
> >>>In any case, I'm considering some digging into the filesystem structures
> >>>to see if I can work this out myself before i do any deduplication. I'm
> >>>fairly sure this should be relatively simple to work out, at least well
> >>>enough for my purposes.
> >>Sadly, there's no way to avoid doing so right now.
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 20:07             ` James Pharaoh
@ 2016-11-14 21:22               ` Zygo Blaxell
  0 siblings, 0 replies; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-14 21:22 UTC (permalink / raw)
  To: James Pharaoh; +Cc: Austin S. Hemmelgarn, Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3411 bytes --]

On Mon, Nov 14, 2016 at 09:07:51PM +0100, James Pharaoh wrote:
> On 14/11/16 20:51, Zygo Blaxell wrote:
> >On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
> >>On 2016-11-14 13:22, James Pharaoh wrote:
> >>>One thing I am keen to understand is if BTRFS will automatically ignore
> >>>a request to deduplicate a file if it is already deduplicated? Given the
> >>>performance I see when doing a repeat deduplication, it seems to me that
> >>>it can't be doing so, although this could be caused by the CPU usage you
> >>>mention above.
> >>
> >>What's happening is that the dedupe ioctl does a byte-wise comparison of the
> >>ranges to make sure they're the same before linking them.  This is actually
> >>what takes most of the time when calling the ioctl, and is part of why it
> >>takes longer the larger the range to deduplicate is.  In essence, it's
> >>behaving like an OS should and not trusting userspace to make reasonable
> >>requests (which is also why there's a separate ioctl to clone a range from
> >>another file instead of deduplicating existing data).
> >
> >	- the extent-same ioctl could check to see which extents
> >	are referenced by the src and dst ranges, and return success
> >	immediately without reading data if they are the same (but
> >	userspace should already know this, or it's wasting a huge amount
> >	of time before it even calls the kernel).
> 
> Yes, this is what I am talking about. I believe I should be able to read
> data about the BTRFS data structures and determine if this is the case. I
> don't care if there are false matches, due to concurrent updates, but
> there'll be a /lot/ of repeat deduplications unless I do this, because even
> if the file is identical, the mtime etc hasn't changed, and I have a record
> of previously doing a dedupe, there's no guarantee that the file hasn't been
> rewritten in place (eg by rsync), and no way that I know of to reliably
> detect if a file has been changed.
> 
> I am sure there are libraries out there which can look into the data
> structures of a BTRFS file system, I haven't researched this in detail
> though. I imagine that with some kind of lock on a BTRFS root, this could be
> achieved by simply reading the data from the disk, since I believe that
> everything is copy-on-write, so no existing data should be overwritten until
> all roots referring to it are updated. Perhaps I'm missing something
> though...

FIEMAP (VFS) and SEARCH_V2 (btrfs-specific) will both give you access
to the underlying physical block numbers.  SEARCH_V2 is non-trivial
to use without reverse-engineering significant parts of btrfs-progs.
SEARCH_V2 is a generic tree-searching tool which will give you all kinds
of information about btrfs structures...it's essential for a sophisticated
deduplicator and overkill for a simple one.

For full-file dedup using FIEMAP you only need to look at the "physical"
field of the first extent (if it's zero or the same as the other file, the
files cannot be deduplicated or are already deduplicated, respectively).
The source for 'filefrag' (from e2fsprogs) is good for learning how
FIEMAP works.

For block-level dedup you need to look at each extent individually.
That's much slower and full of additional caveats.  If you're going down
that road it's probably better to just improve duperemove instead.

> James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-14 21:10               ` Zygo Blaxell
@ 2016-11-15 12:26                 ` Austin S. Hemmelgarn
  2016-11-15 17:52                   ` Zygo Blaxell
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2016-11-15 12:26 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: James Pharaoh, Mark Fasheh, linux-btrfs

On 2016-11-14 16:10, Zygo Blaxell wrote:
> On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote:
>> On 2016-11-14 14:51, Zygo Blaxell wrote:
>>> Deduplicating an extent that may might be concurrently modified during the
>>> dedup is a reasonable userspace request.  In the general case there's
>>> no way for userspace to ensure that it's not happening.
>> I'm not even talking about the locking, I'm talking about the data
>> comparison that the ioctl does to ensure they are the same before
>> deduplicating them, and specifically that protecting against userspace just
>> passing in two random extents that happen to be the same size but not
>> contain the same data (because deduplication _should_ reject such a
>> situation, that's what the clone ioctl is for).
>
> If I'm deduping a VM image, and the virtual host is writing to said image
> (which is likely since an incremental dedup will be intentionally doing
> dedup over recently active data sets), the extent I just compared in
> userspace might be different by the time the kernel sees it.
>
> This is an important reason why the whole lock/read/compare/replace step
> is an atomic operation from userspace's PoV.
>
> The read also saves having to confirm a short/weak hash isn't a collision.
> The RAM savings from using weak hashes (~48 bits) are a huge performance
> win.
>
> The locking overhead is very small compared to the reading overhead,
> and (in the absence of bugs) it will only block concurrent writes to the
> same offset range in the src/dst inodes (based on a read of the code...I
> don't know if there's also an inode-level or backref-level barrier that
> expands the locking scope).
I'm not arguing that it's a bad thing that the kernel is doing this, I'm 
just saying that the locking overhead is minuscule in most cases 
compared to the data comparison.  It is absolutely necessary for exactly 
the reasons you are outlining.
>
> I'm not sure the ioctl is well designed for simply throwing random
> data at it, especially not entire files (it can't handle files over
> 16MB anyway).  It will read more data than it has to compared to a
> block-by-block comparison from userspace with prefetches or a pair of
> IO threads.  If userspace reads both copies of the data just before
> issuing the extent-same call, the kernel will read the data from cache
> reasonably quickly.
It still depends on the use case to a certain extent.  In the case I was 
using as an example, I know to a reasonably certain degree (barring 
tampering, bugs, or hardware failure) that any two files are identical, 
and I actually don't want to trash the page-cache just to deduplicate 
data faster (he data set in question is large, but most of it is idle at 
any given point in time), so there's no point in me prereading 
everything in userspace, which in turn makes the script I use much 
simpler (the most complex part is figuring out how to split extents for 
files bigger than the ioctl can handle such that I don't have tiny tail 
extents but still have a minimum number per file).
>
>> The locking is perfectly reasonable and shouldn't contribute that much to
>> the overhead (unless you're being crazy and deduplicating thousands of tiny
>> blocks of data).
>
> Why is deduplicating thousands of blocks of data crazy?  I already
> deduplicate four orders of magnitude more than that per week.
You missed the 'tiny' quantifier.  I'm talking really small blocks, on 
the order of less than 64k (so, IOW, stuff that's not much bigger than a 
few filesystem blocks), and that is somewhat crazy because it ends up 
not only taking _really_ long to do compared to larger chunks (because 
you're running more independent hashes than with bigger blocks), but 
also because it will often split extents unnecessarily and contribute to 
fragmentation, which will lead to all kinds of other performance 
problems on the FS.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-15 12:26                 ` Austin S. Hemmelgarn
@ 2016-11-15 17:52                   ` Zygo Blaxell
  2016-11-16 22:24                     ` Niccolò Belli
  0 siblings, 1 reply; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-15 17:52 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: James Pharaoh, Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

On Tue, Nov 15, 2016 at 07:26:53AM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 16:10, Zygo Blaxell wrote:
> >Why is deduplicating thousands of blocks of data crazy?  I already
> >deduplicate four orders of magnitude more than that per week.
> You missed the 'tiny' quantifier.  I'm talking really small blocks, on the
> order of less than 64k (so, IOW, stuff that's not much bigger than a few
> filesystem blocks), and that is somewhat crazy because it ends up not only
> taking _really_ long to do compared to larger chunks (because you're running
> more independent hashes than with bigger blocks), but also because it will
> often split extents unnecessarily and contribute to fragmentation, which
> will lead to all kinds of other performance problems on the FS.

Like I said, millions of extents per week...

64K is an enormous dedup block size, especially if it comes with a 64K
alignment constraint as well.

These are the top ten duplicate block sizes from a sample of 95251
dedup ops on a medium-sized production server with 4TB of filesystem
(about one machine-day of data):

        total bytes     extent count    dup size
        2750808064      20987           131072
        803733504       1533            524288
        123801600       975             126976
        103575552       8429            12288
        97443840        793             122880
        82051072        10016           8192
        77492224        18919           4096
        71331840        645             110592
        64143360        540             118784
        63897600        650             98304

        all bytes       all extents     average dup size
        6129995776      95251           64356

128K and 512K are the most common sizes due to btrfs compression (it
limits the block size to 128K for compressed extents and seems to limit
uncompressed extents to 512K for some reason).  12K is #4, and 3 of the
top ten sizes are below 16K.  The average size is just a little below 64K.

These are the duplicates with block sizes smaller than 64K:

        total bytes     extent count    extent size
        41615360        635             65536
        46264320        753             61440
        45817856        799             57344
        41267200        775             53248 
        45760512        931             49152
        46948352        1042            45056
        43417600        1060            40960
        47296512        1283            36864
        59277312        1809            32768
        49029120        1710            28672
        43745280        1780            24576
        53616640        2618            20480
        43466752        2653            16384
        103575552       8429            12288
        82051072        10016           8192 
        77492224        18919           4096 

        all bytes <=64K extents <=64K   average dup size <=64K
        870641664       55212           15769

14% of my duplicate bytes are in blocks smaller than 64K or blocks not
aligned to a 64K boundary within a file.  It's too large a space saving
to ignore on machines that have constrained storage.

It may be worthwhile skipping 4K and 8K dedups--at 250 ms per dedup,
they're 30% of the total run time and only 2.6% of the total dedup bytes.
On the other hand, this machine is already deduping everything fast enough
to keep up with new data, so there's no performance problem to solve here.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-15 17:52                   ` Zygo Blaxell
@ 2016-11-16 22:24                     ` Niccolò Belli
  2016-11-17  3:01                       ` Zygo Blaxell
  0 siblings, 1 reply; 42+ messages in thread
From: Niccolò Belli @ 2016-11-16 22:24 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Austin S. Hemmelgarn, James Pharaoh, Mark Fasheh, linux-btrfs

On martedì 15 novembre 2016 18:52:01 CET, Zygo Blaxell wrote:
> Like I said, millions of extents per week...
>
> 64K is an enormous dedup block size, especially if it comes with a 64K
> alignment constraint as well.
>
> These are the top ten duplicate block sizes from a sample of 95251
> dedup ops on a medium-sized production server with 4TB of filesystem
> (about one machine-day of data):

Which software do you use to dedupe your data? I tried duperemove but it 
gets killed by the OOM killer because it triggers some kind of memory leak: 
https://github.com/markfasheh/duperemove/issues/163

Niccolò Belli

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-16 22:24                     ` Niccolò Belli
@ 2016-11-17  3:01                       ` Zygo Blaxell
  2016-11-18 10:36                         ` Niccolò Belli
  0 siblings, 1 reply; 42+ messages in thread
From: Zygo Blaxell @ 2016-11-17  3:01 UTC (permalink / raw)
  To: Niccolò Belli
  Cc: Austin S. Hemmelgarn, James Pharaoh, Mark Fasheh, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

On Wed, Nov 16, 2016 at 11:24:33PM +0100, Niccolò Belli wrote:
> On martedì 15 novembre 2016 18:52:01 CET, Zygo Blaxell wrote:
> >Like I said, millions of extents per week...
> >
> >64K is an enormous dedup block size, especially if it comes with a 64K
> >alignment constraint as well.
> >
> >These are the top ten duplicate block sizes from a sample of 95251
> >dedup ops on a medium-sized production server with 4TB of filesystem
> >(about one machine-day of data):
> 
> Which software do you use to dedupe your data? I tried duperemove but it
> gets killed by the OOM killer because it triggers some kind of memory leak:
> https://github.com/markfasheh/duperemove/issues/163

Duperemove does use a lot of memory, but the logs at that URL only show
2G of RAM in duperemove--not nearly enough to trigger OOM under normal
conditions on an 8G machine.  There's another process with 6G of virtual
address space (although much less than that resident) that looks more
interesting (i.e. duperemove might just be the victim of some interaction
between baloo_file and the OOM killer).

On the other hand, the logs also show kernel 4.8.  100% of my test
machines failed to finish booting before they were cut down by OOM on
4.7.x kernels.  The same problem occurs on early kernels in the 4.8.x
series.  I am having good results with 4.8.6 and later, but you should
be aware that significant changes have been made to the way OOM works
in these kernel versions, and maybe you're hitting a regression for your
use case.

> Niccolò Belli
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Announcing btrfs-dedupe
  2016-11-17  3:01                       ` Zygo Blaxell
@ 2016-11-18 10:36                         ` Niccolò Belli
  0 siblings, 0 replies; 42+ messages in thread
From: Niccolò Belli @ 2016-11-18 10:36 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Austin S. Hemmelgarn, James Pharaoh, Mark Fasheh, linux-btrfs

On giovedì 17 novembre 2016 04:01:52 CET, Zygo Blaxell wrote:
> Duperemove does use a lot of memory, but the logs at that URL only show
> 2G of RAM in duperemove--not nearly enough to trigger OOM under normal
> conditions on an 8G machine.  There's another process with 6G of virtual
> address space (although much less than that resident) that looks more
> interesting (i.e. duperemove might just be the victim of some interaction
> between baloo_file and the OOM killer).

Thanks, I killed baloo_file before starting duperemove and it somehow 
improved (it reached 99.73% before getting killed by OOM killer once 
again):

[ 6342.147251] Purging GPU memory, 0 pages freed, 18268 pages still pinned.
[ 6342.147253] 48 and 0 pages still available in the bound and unbound GPU 
page lists.
[ 6342.147340] Xorg invoked oom-killer: 
gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3, 
oom_score_adj=0
[ 6342.147341] Xorg cpuset=/ mems_allowed=0
[ 6342.147346] CPU: 3 PID: 650 Comm: Xorg Not tainted 4.8.8-2-ARCH #1
[ 6342.147347] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A09 
08/29/2016
[ 6342.147348]  0000000000000286 000000009b89a9c8 ffff88020752f598 
ffffffff812fde10
[ 6342.147351]  ffff88020752f758 ffff8801edc62ac0 ffff88020752f608 
ffffffff81205fa2
[ 6342.147353]  000188020752f5a0 000000009b89a9c8 00000000ffffffff 
0000000000000000
[ 6342.147356] Call Trace:
[ 6342.147361]  [<ffffffff812fde10>] dump_stack+0x63/0x83
[ 6342.147364]  [<ffffffff81205fa2>] dump_header+0x5c/0x1ea
[ 6342.147366]  [<ffffffff8117ca35>] oom_kill_process+0x265/0x410
[ 6342.147368]  [<ffffffff810864c7>] ? has_capability_noaudit+0x17/0x20
[ 6342.147369]  [<ffffffff8117cfb0>] out_of_memory+0x380/0x420
[ 6342.147373]  [<ffffffff81312c48>] ? find_next_bit+0x18/0x20
[ 6342.147374]  [<ffffffff81182610>] __alloc_pages_nodemask+0xda0/0xde0
[ 6342.147377]  [<ffffffff811d6605>] alloc_pages_current+0x95/0x140
[ 6342.147380]  [<ffffffff811a3e6e>] kmalloc_order_trace+0x2e/0xf0
[ 6342.147382]  [<ffffffff811e1f4a>] __kmalloc+0x1ea/0x200
[ 6342.147397]  [<ffffffffa05f825e>] ? alloc_gen8_temp_bitmaps+0x2e/0x80 
[i915]
[ 6342.147407]  [<ffffffffa05f8277>] alloc_gen8_temp_bitmaps+0x47/0x80 
[i915]
[ 6342.147417]  [<ffffffffa05fb648>] gen8_alloc_va_range_3lvl+0x98/0x9c0 
[i915]
[ 6342.147419]  [<ffffffff81197f3d>] ? shmem_getpage_gfp+0xed/0xc30
[ 6342.147421]  [<ffffffff8130e3ba>] ? sg_init_table+0x1a/0x40
[ 6342.147423]  [<ffffffff813292e3>] ? swiotlb_map_sg_attrs+0x53/0x130
[ 6342.147432]  [<ffffffffa05fc1c6>] gen8_alloc_va_range+0x256/0x490 [i915]
[ 6342.147442]  [<ffffffffa05fea7b>] i915_vma_bind+0x9b/0x190 [i915]
[ 6342.147453]  [<ffffffffa060522b>] i915_gem_object_do_pin+0x86b/0xa90 
[i915]
[ 6342.147463]  [<ffffffffa060547d>] i915_gem_object_pin+0x2d/0x30 [i915]
[ 6342.147472]  [<ffffffffa05f316f>] 
i915_gem_execbuffer_reserve_vma.isra.7+0x9f/0x180 [i915]
[ 6342.147482]  [<ffffffffa05f35e6>] 
i915_gem_execbuffer_reserve.isra.8+0x396/0x3c0 [i915]
[ 6342.147491]  [<ffffffffa05f457b>] 
i915_gem_do_execbuffer.isra.14+0x68b/0x1270 [i915]
[ 6342.147493]  [<ffffffff8158ad31>] ? unix_stream_read_generic+0x281/0x8a0
[ 6342.147503]  [<ffffffffa05f5dc4>] i915_gem_execbuffer2+0x104/0x270 
[i915]
[ 6342.147509]  [<ffffffffa046cd40>] drm_ioctl+0x200/0x4f0 [drm]
[ 6342.147518]  [<ffffffffa05f5cc0>] ? i915_gem_execbuffer+0x330/0x330 
[i915]
[ 6342.147520]  [<ffffffff810ecc1d>] ? enqueue_hrtimer+0x3d/0xa0
[ 6342.147522]  [<ffffffff813061f4>] ? timerqueue_del+0x24/0x70
[ 6342.147523]  [<ffffffff810ececc>] ? __remove_hrtimer+0x3c/0x90
[ 6342.147525]  [<ffffffff8121c383>] do_vfs_ioctl+0xa3/0x5f0
[ 6342.147527]  [<ffffffff810ee65b>] ? do_setitimer+0x12b/0x230
[ 6342.147529]  [<ffffffff812275f7>] ? __fget+0x77/0xb0
[ 6342.147531]  [<ffffffff8121c949>] SyS_ioctl+0x79/0x90
[ 6342.147533]  [<ffffffff815f7e72>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[ 6342.147535] Mem-Info:
[ 6342.147538] active_anon:76311 inactive_anon:76782 isolated_anon:0
                active_file:347581 inactive_file:1415592 isolated_file:64
                unevictable:8 dirty:482 writeback:0 unstable:0
                slab_reclaimable:27219 slab_unreclaimable:14772
                mapped:20714 shmem:30458 pagetables:10557 bounce:0
                free:25642 free_pcp:327 free_cma:0
[ 6342.147541] Node 0 active_anon:305244kB inactive_anon:307128kB 
active_file:1390324kB inactive_file:5662368kB unevictable:32kB 
isolated(anon):0kB isolated(file):256kB mapped:82856kB dirty:1928kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 81920kB anon_thp: 
121832kB writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? 
no
[ 6342.147542] Node 0 DMA free:15688kB min:132kB low:164kB high:196kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15984kB managed:15896kB 
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:208kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 6342.147545] lowmem_reserve[]: 0 3395 7850 7850 7850
[ 6342.147548] Node 0 DMA32 free:48772kB min:29172kB low:36464kB 
high:43756kB active_anon:84724kB inactive_anon:87164kB active_file:555728kB 
inactive_file:2639796kB unevictable:0kB writepending:696kB 
present:3564504kB managed:3488752kB mlocked:0kB slab_reclaimable:47196kB 
slab_unreclaimable:11472kB kernel_stack:192kB pagetables:200kB bounce:0kB 
free_pcp:1284kB local_pcp:0kB free_cma:0kB
[ 6342.147553] lowmem_reserve[]: 0 0 4454 4454 4454
[ 6342.147555] Node 0 Normal free:38108kB min:38276kB low:47844kB 
high:57412kB active_anon:220520kB inactive_anon:219964kB 
active_file:834596kB inactive_file:3022312kB unevictable:32kB 
writepending:1232kB present:4694016kB managed:4561616kB mlocked:32kB 
slab_reclaimable:61680kB slab_unreclaimable:47408kB kernel_stack:7776kB 
pagetables:42028kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 6342.147558] lowmem_reserve[]: 0 0 0 0 0
[ 6342.147561] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 1*64kB (U) 2*128kB 
(U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15688kB
[ 6342.147570] Node 0 DMA32: 12108*4kB (UE) 29*8kB (UE) 0*16kB 0*32kB 
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 48664kB
[ 6342.147578] Node 0 Normal: 9514*4kB (UMH) 47*8kB (H) 0*16kB 0*32kB 
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 38432kB
[ 6342.147586] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=1048576kB
[ 6342.147588] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB
[ 6342.147589] 1796570 total pagecache pages
[ 6342.147591] 2853 pages in swap cache
[ 6342.147592] Swap cache stats: add 378584, delete 375731, find 
101539/126272
[ 6342.147592] Free swap  = 7743900kB
[ 6342.147593] Total swap = 8387904kB
[ 6342.147594] 2068626 pages RAM
[ 6342.147594] 0 pages HighMem/MovableOnly
[ 6342.147595] 52060 pages reserved
[ 6342.147595] 0 pages hwpoisoned
[ 6342.147596] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds 
swapents oom_score_adj name
[ 6342.147600] [  244]     0   244    21377     2220      37       4       
87             0 systemd-journal
[ 6342.147603] [  278]     0   278     9122       33      20       3      
334         -1000 systemd-udevd
[ 6342.147605] [  597]   192   597    35700       41      39       3      
146             0 systemd-timesyn
[ 6342.147607] [  601]    84   601    10328      158      24       3       
74             0 avahi-daemon
[ 6342.147609] [  603]     0   603     1841        3       9       3       
21             0 gpm
[ 6342.147610] [  604]     0   604     9294       32      21       3      
140             0 bluetoothd
[ 6342.147612] [  606]    84   606    10295        0      23       3       
78             0 avahi-daemon
[ 6342.147614] [  607]    81   607     8584      716      21       3       
95          -900 dbus-daemon
[ 6342.147615] [  609]     0   609     9635      373      23       3      
114             0 systemd-logind
[ 6342.147617] [  610]     0   610    90299     1407      74       3     
1048             0 NetworkManager
[ 6342.147619] [  629]     0   629    22126      369      46       3      
462             0 cupsd
[ 6342.147621] [  630]     0   630    10100        0      24       3      
143         -1000 sshd
[ 6342.147622] [  631]     0   631   186671       75     145       4     
1675             0 libvirtd
[ 6342.147624] [  633]     0   633    47157      132      44       3      
345             0 sddm
[ 6342.147626] [  650]     0   650    80478    10316     161       3     
2042             0 Xorg
[ 6342.147628] [  651]   124   651    78876       73      53       3     
2031             0 colord
[ 6342.147630] [  663]     0   663    79429      432     150       4      
621             0 smbd
[ 6342.147631] [  665]     0   665    78446        9     144       4      
643             0 smbd-notifyd
[ 6342.147634] [  666]     0   666    78444       10     142       4      
642             0 cleanupd
[ 6342.147635] [  669]     0   669    79429      664     145       4      
598             0 lpqd
[ 6342.147637] [  681]     0   681    10855        0      25       3      
114             0 wpa_supplicant
[ 6342.147639] [  689]   102   689   131227      585      53       3     
2366             0 polkitd
[ 6342.147640] [  777]    99   777    11640        0      26       3      
134             0 dnsmasq
[ 6342.147642] [  778]     0   778    11640        0      25       3      
132             0 dnsmasq
[ 6342.147644] [  838]    99   838    11640      108      24       3      
130             0 dnsmasq
[ 6342.147645] [  839]     0   839    11607        0      24       3       
90             0 dnsmasq
[ 6342.147647] [  844]     0   844    93062      666      46       3      
248             0 udisksd
[ 6342.147648] [  850]     0   850    75056      510      46       3      
285             0 upowerd
[ 6342.147650] [  883]     0   883    40487      226      69       3      
397             0 sddm-helper
[ 6342.147651] [  884]  1000   884    13702      505      31       3      
206             0 systemd
[ 6342.147653] [  885]  1000   885    24689       15      49       3      
428             0 (sd-pam)
[ 6342.147655] [  894]  1000   894     3425       51      11       3      
114             0 startkde
[ 6342.147656] [  901]  1000   901     8539       69      21       3      
419             0 dbus-daemon
[ 6342.147658] [  935]  1000   935     1526        0       8       3       
21             0 start_kdeinit
[ 6342.147659] [  936]  1000   936    75329      236     126       4      
762             0 kdeinit5
[ 6342.147661] [  937]  1000   937   129068       50     169       3      
942             0 klauncher
[ 6342.147662] [  940]  1000   940   367750     2255     318       5     
2577             0 kded5
[ 6342.147664] [  948]  1000   948   132228      123     168       4     
1562             0 kaccess
[ 6342.147666] [  962]  1000   962   136114      850      63       4      
439             0 mission-control
[ 6342.147667] [  967]  1000   967   129490      142     162       4     
1190             0 kglobalaccel5
[ 6342.147669] [  973]  1000   973    46251       73      26       3      
135             0 dconf-service
[ 6342.147670] [  980]  1000   980    21602       62      32       4      
145             0 kwrapper5
[ 6342.147672] [  981]  1000   981   159329      271     188       3     
1237             0 ksmserver
[ 6342.147673] [  991]  1000   991    89508      221      99       3      
502             0 kscreen_backend
[ 6342.147675] [  993]  1000   993   780636      196     268       5     
6207             0 kwin_x11
[ 6342.147677] [  996]  1000   996   164381      573     203       4     
1179             0 kdeconnectd
[ 6342.147678] [  997]  1000   997   190138      156     242       4     
4271             0 krunner
[ 6342.147680] [  999]  1000   999   928816    13708     461       5    
17112             0 plasmashell
[ 6342.147682] [ 1000]  1000  1000   172107      144     179       4     
1034             0 polkit-kde-auth
[ 6342.147683] [ 1001]  1000  1001   129793       58     167       3      
934             0 xembedsniproxy
[ 6342.147685] [ 1006]  1000  1006   127736      476     167       3      
860             0 org_kde_powerde
[ 6342.147686] [ 1007]  1000  1007   267637      360     284       4     
4630             0 korgac
[ 6342.147688] [ 1077]  1000  1077   125471       67      93       4     
1316             0 pulseaudio
[ 6342.147689] [ 1079]   133  1079    44462        0      22       3       
71             0 rtkit-daemon
[ 6342.147691] [ 1093]  1000  1093   199345       71      65       4     
5741             0 xiccd
[ 6342.147692] [ 1116]  1000  1116    21171        0      43       3      
175             0 gconf-helper
[ 6342.147694] [ 1118]  1000  1118    16483       25      34       3      
198             0 gconfd-2
[ 6342.147695] [ 1123]  1000  1123    70044        0      38       3      
225             0 gvfsd
[ 6342.147697] [ 1125]  1000  1125   226815      111     183       4     
1213             0 kactivitymanage
[ 6342.147701] [ 1130]  1000  1130    86949       69      35       3      
744             0 gvfsd-fuse
[ 6342.147702] [ 1146]  1000  1146   129379      146     164       3      
948             0 kactivitymanage
[ 6342.147704] [ 1150]  1000  1150   129380      130     158       4      
956             0 kactivitymanage
[ 6342.147705] [ 1195]  1000  1195    10293       44      24       3      
149             0 obexd
[ 6342.147707] [ 1197]  1000  1197   128967      247     166       3      
944             0 akonadi_control
[ 6342.147709] [ 1206]  1000  1206   682254        0     210       6     
3962             0 akonadiserver
[ 6342.147710] [ 1211]  1000  1211   146367     5703     109       3    
19313             0 mysqld
[ 6342.147712] [ 1261]  1000  1261   172241        0     178       4     
1600             0 akonadi_akonote
[ 6342.147714] [ 1262]  1000  1262   328923        0     321       4     
4533             0 akonadi_archive
[ 6342.147715] [ 1263]  1000  1263   193336        0     187       3     
1156             0 akonadi_birthda
[ 6342.147717] [ 1264]  1000  1264   169386        6     177       4     
1586             0 akonadi_contact
[ 6342.147718] [ 1265]  1000  1265   196042      380     192       4     
1077             0 akonadi_followu
[ 6342.147719] [ 1266]  1000  1266   195484      884     195       4     
1081             0 akonadi_ical_re
[ 6342.147721] [ 1269]  1000  1269   148974      114     207       4     
2098             0 kwalletd5
[ 6342.147723] [ 1271]  1000  1271   178961        0     187       4     
1328             0 akonadi_indexin
[ 6342.147724] [ 1272]  1000  1272   172239        0     179       4     
1608             0 akonadi_maildir
[ 6342.147726] [ 1275]  1000  1275   229896      260     192       4     
1551             0 akonadi_maildis
[ 6342.147727] [ 1278]  1000  1278   356078        0     344       5     
5204             0 akonadi_mailfil
[ 6342.147729] [ 1279]  1000  1279   169395        9     179       4     
1077             0 akonadi_migrati
[ 6342.147730] [ 1281]  1000  1281   237225        0     271       4     
3373             0 akonadi_newmail
[ 6342.147731] [ 1283]  1000  1283   296254     1054     280       4     
3818             0 akonadi_notes_a
[ 6342.147733] [ 1284]  1000  1284   292138      501     312       4     
3846             0 akonadi_sendlat
[ 6342.147736] [ 1486]  1000  1486   129345       71     170       3      
977             0 kuiserver5
[ 6342.147738] [ 1575]  1000  1575   157148    10329     220       4     
2473             0 konsole
[ 6342.147739] [ 1579]  1000  1579     4021       56      13       3      
164             0 bash
[ 6342.147741] [ 1582]     0  1582    17563       23      38       3      
248             0 sudo
[ 6342.147742] [ 1583]     0  1583     3425       53      11       3      
118             0 duperemove.sh
[ 6342.147744] [ 4060]     0  4060   168501    92579     203       3       
24             0 duperemove
[ 6342.147746] Out of memory: Kill process 4060 (duperemove) score 21 or 
sacrifice child
[ 6342.147754] Killed process 4060 (duperemove) total-vm:674004kB, 
anon-rss:367672kB, file-rss:2644kB, shmem-rss:0kB



Any idea? The process with the highest total_vm is plasmashell, but it has 
only 900MB of vm.

Niccolò Belli

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-11-18 10:36 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-06 13:30 Announcing btrfs-dedupe James Pharaoh
2016-11-07 14:02 ` David Sterba
2016-11-07 17:48   ` Mark Fasheh
2016-11-07 20:54     ` Adam Borowski
2016-11-08  2:17       ` Darrick J. Wong
2016-11-08 18:59         ` Mark Fasheh
2016-11-08 19:47           ` Darrick J. Wong
2016-11-08 19:47             ` [Ocfs2-devel] " Darrick J. Wong
2016-11-09 15:02       ` David Sterba
2016-11-08  2:40   ` Christoph Anton Mitterer
2016-11-08  6:11     ` James Pharaoh
2016-11-08 13:26     ` Austin S. Hemmelgarn
2016-11-08 16:57       ` Darrick J. Wong
2016-11-08 17:04         ` Austin S. Hemmelgarn
2016-11-08 18:49     ` Mark Fasheh
2016-11-07 17:59 ` Mark Fasheh
2016-11-07 18:49   ` James Pharaoh
2016-11-07 18:53     ` James Pharaoh
2016-11-14 18:07     ` Zygo Blaxell
2016-11-14 18:22       ` James Pharaoh
2016-11-14 18:39         ` Austin S. Hemmelgarn
2016-11-14 19:51           ` Zygo Blaxell
2016-11-14 19:56             ` Austin S. Hemmelgarn
2016-11-14 21:10               ` Zygo Blaxell
2016-11-15 12:26                 ` Austin S. Hemmelgarn
2016-11-15 17:52                   ` Zygo Blaxell
2016-11-16 22:24                     ` Niccolò Belli
2016-11-17  3:01                       ` Zygo Blaxell
2016-11-18 10:36                         ` Niccolò Belli
2016-11-14 20:07             ` James Pharaoh
2016-11-14 21:22               ` Zygo Blaxell
2016-11-14 18:43         ` Zygo Blaxell
2016-11-08 11:06 ` Niccolò Belli
2016-11-08 11:38   ` James Pharaoh
2016-11-08 16:57     ` Niccolò Belli
2016-11-08 16:58       ` James Pharaoh
2016-11-08 17:08         ` Niccolò Belli
2016-11-14 18:27   ` Zygo Blaxell
2016-11-08 22:36 ` Saint Germain
2016-11-09 11:24   ` Niccolò Belli
2016-11-09 12:47     ` Saint Germain
2016-11-13 12:45   ` James Pharaoh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.