All of lore.kernel.org
 help / color / mirror / Atom feed
* Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
@ 2016-01-16 12:27 Al
  2016-01-16 14:10 ` Duncan
  2016-01-18  1:36 ` Qu Wenruo
  0 siblings, 2 replies; 27+ messages in thread
From: Al @ 2016-01-16 12:27 UTC (permalink / raw)
  To: linux-btrfs

Hi,

This must be a silly question! Please assume that I know not much more than
nothing abou*t fs. 
I know dedup is traditionally costs a lot of memory, but I don't really
understand why it is done like that. Let me explain my question:

AFAICT dedup matches file level chunks (or whatever you call them) using a
hash function or something which has limited collision potential. The hash
is used to match blocks as they are committed to disk, I'm talking online
dedup*, and reflink/eliminate the duplicated blocks as necessary.  This
bloody great hash tree is saved in memory for speed of lookup (I assume).

But why?

Is there any urgency for dedup? What's wrong with storing the hash on disk
with the block and having a separate process dedup the written data over
time; dedup'ing data immediately when written to high-write-count data is
counter productive because no sooner has it been deduped then it is rendered
obsolete by another COW write.

There's also the problem of opening a potential problem window before the
commit to disk, hopefully covered by the journal, whilst we seek the
relevant duplicate if there is one.

Help me out peeps? Why is there a such an urgency to have online dedup,
rather than a triggered/delayed dedup, similar the current autodefrag process?

Thank you. I'm sure the answer is obvious, but not to me!

* dedup/dedupe/deduplication 





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-16 12:27 Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls Al
@ 2016-01-16 14:10 ` Duncan
  2016-01-16 18:07   ` Rich Freeman
  2016-01-20 14:43   ` Al
  2016-01-18  1:36 ` Qu Wenruo
  1 sibling, 2 replies; 27+ messages in thread
From: Duncan @ 2016-01-16 14:10 UTC (permalink / raw)
  To: linux-btrfs

Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:

> This must be a silly question! Please assume that I know not much more
> than nothing abou*t fs.
> I know dedup is traditionally costs a lot of memory, but I don't really
> understand why it is done like that. Let me explain my question:
> 
> AFAICT dedup matches file level chunks (or whatever you call them) using
> a hash function or something which has limited collision potential. The
> hash is used to match blocks as they are committed to disk, I'm talking
> online dedup*, and reflink/eliminate the duplicated blocks as necessary.
>  This bloody great hash tree is saved in memory for speed of lookup (I
> assume).
> 
> But why?
> 
> Is there any urgency for dedup? What's wrong with storing the hash on
> disk with the block and having a separate process dedup the written data
> over time; dedup'ing data immediately when written to high-write-count
> data is counter productive because no sooner has it been deduped then it
> is rendered obsolete by another COW write.
> 
> There's also the problem of opening a potential problem window before
> the commit to disk, hopefully covered by the journal, whilst we seek the
> relevant duplicate if there is one.
> 
> Help me out peeps? Why is there a such an urgency to have online dedup,
> rather than a triggered/delayed dedup, similar the current autodefrag
> process?
> 
> Thank you. I'm sure the answer is obvious, but not to me!
> 
> * dedup/dedupe/deduplication

There's actually uses for both inline and out-of-line[1] aka delayed 
dedup.  Btrfs already has a number of independent products doing various 
forms of out-of-line dedup, so what's missing and being developed now is 
the inline dedup option, which being directly in the write processing, 
must be handled by btrfs itself -- it can't be primarily done by third 
parties with just a few kernel calls, like out-of-line dedup can.

Meanwhile, the inline dedup implementation being considered for mainline 
is itself built on two previously available implementations, developed 
more or less independently with different goals in mind, with the planned 
mainline implementation sharing what it can between the two but still 
giving the user the choice of which one to actually run.

The one uses the in-memory hash functionality much as you described.  
This one should be faster, but will require more memory to store the 
hashes and will miss some dedup opportunities simply because it doesn't 
have them hashed when the write request comes.

The other one will store its hashes on block-device[2], making it slower, 
but also allowing it to have higher capacity hash storage, which being on-
block-device, will normally survive reboots and simple umount/mount 
cycles, thus deduplicating far more efficiently, if at the expense of 
speed.

But because both of these are inline implementations, they compare 
incoming writes to what they already have hashed, and thus don't take two 
filenames to compare and dedup if possible.  That functionality is thus 
left for out-of-line dedup methods, if desired.  Particularly if one is 
using the inline in-memory variant, they may well want to followup with 
out-of-line dedup runs at a later time, in ordered to catch what the fast 
but not particularly efficient inline-in-memory dedup missed.

Make more sense now? =:^)

---
[1] I prefer the terms inline and out-of-line to online/offline, since 
the filesystem is still online when they run making the term offline 
confusing, since it doesn't mean "offline" as in what offline means for 
fsck, for instance.

[2] On block-device:  I'm trying to get out of the habit of referring to 
disks, as that sounds rather anachronistic when it could just as easily 
be an ssd, having nothing to do with actual spinning disks.  So I'll 
normally use simply device, or storage device, except here that could be 
confused with memory device, which is the other option, so I call it a 
block-device.
-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-16 14:10 ` Duncan
@ 2016-01-16 18:07   ` Rich Freeman
  2016-01-18 12:23     ` Austin S. Hemmelgarn
  2016-01-20 14:49     ` Al
  2016-01-20 14:43   ` Al
  1 sibling, 2 replies; 27+ messages in thread
From: Rich Freeman @ 2016-01-16 18:07 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Sat, Jan 16, 2016 at 9:10 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
>
>> Is there any urgency for dedup? What's wrong with storing the hash on
>> disk with the block and having a separate process dedup the written data
>> over time;
>
> There's actually uses for both inline and out-of-line[1] aka delayed
> dedup.  Btrfs already has a number of independent products doing various
> forms of out-of-line dedup, so what's missing and being developed now is
> the inline dedup option, which being directly in the write processing,
> must be handled by btrfs itself -- it can't be primarily done by third
> parties with just a few kernel calls, like out-of-line dedup can.

Does the out-of-line dedup option actually utilize stored hashes, or
is it forced to re-read all the data to compute hashes?  If it is
collecting checksums/etc is this done efficiently?

I think he is actually suggesting a hybrid approach where a bit of
effort is done during operations to greatly streamline out-of-line
deduplication.  I'm not sure how close we are to that already, or if
any room for improvement remains.

--
Rich

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-16 12:27 Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls Al
  2016-01-16 14:10 ` Duncan
@ 2016-01-18  1:36 ` Qu Wenruo
  2016-01-18  3:10   ` Duncan
  2016-01-20 14:53   ` Al
  1 sibling, 2 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-01-18  1:36 UTC (permalink / raw)
  To: Al, linux-btrfs



Al wrote on 2016/01/16 12:27 +0000:
> Hi,
>
> This must be a silly question! Please assume that I know not much more than
> nothing abou*t fs.
> I know dedup is traditionally costs a lot of memory, but I don't really
> understand why it is done like that. Let me explain my question:

As one of the author of the recent btrfs inband dedup patches, at least 
from my codes, dedup doesn't cost a lot of memory, unless stupid user 
gives a stupid memory limit for in-memory backend.

And for on-disk backend, the memory pressure is even smaller.
Kernel can trigger a transaction commit to reclaim the pages caches used 
for dedup.
So I didn't see what's wrong with the memory usage.

>
> AFAICT dedup matches file level chunks (or whatever you call them) using a
> hash function or something which has limited collision potential.

The more accurate term should be "file extent" though.

> The hash
> is used to match blocks as they are committed to disk, I'm talking online
> dedup*, and reflink/eliminate the duplicated blocks as necessary.  This
> bloody great hash tree is saved in memory for speed of lookup (I assume).

No, you can choose whether to store it in memory or on disk.
Which is one of the selling point of the patchset I recently submitted.
Before this, either using Liu Bo's on-disk one, or my early pure 
in-memory one.

And unfortunately (or in fact fortunately?), the size of hash is already 
quite small.

Current dedup unit (although I use the term "dedup blocksize") is 16K, 
which means only write larger than 16K will go through inband dedup.
So, 16K data = one hash = 112 bytes.
For 1G data, it's about just 7M.
(BTW, 1G data means 1M CRC32 checksum, although it can be stored into 
disk, just like what we do in on-disk backend)

And the dedup blocksize can be tuned from 4K to 8M.
If using 8M dedup blocksize. 1G data only takes about 14K memory.
Much smaller than btrfs CRC32 checksum.

Not to mention there is a memory usage limit and there is also on-disk 
backend.

>
> But why?
>
> Is there any urgency for dedup? What's wrong with storing the hash on disk
> with the block and having a separate process dedup the written data over
> time;

And that's almost what on-disk backend doing.

> dedup'ing data immediately when written to high-write-count data is
> counter productive because no sooner has it been deduped then it is rendered
> obsolete by another COW write.

And it seems that you are not familiar how kernel is caching data for 
filesystem.
There is already kernel page cache for such case.
No matter how many times you write, as long as you're doing buffered 
write the the data is not written to disk but cached by kernel, until 
either you triggered a manual sync or memory pressure hits threshold.

And inband dedup doesn't happen *until* the cached data is going to 
written to disk.
So all you concerned is not a problem.
No extra CPU/memory is used until you're committing data to disk.

>
> There's also the problem of opening a potential problem window before the
> commit to disk, hopefully covered by the journal, whilst we seek the
> relevant duplicate if there is one.

>
> Help me out peeps? Why is there a such an urgency to have online dedup,
> rather than a triggered/delayed dedup, similar the current autodefrag process?
>
> Thank you. I'm sure the answer is obvious, but not to me!

Although I really don't like to say things like this,
but please, READ THE "FUNNY" CODE.

I used to have a lot of questions and "good" ideas about btrfs,
but as I digging into the code, the question disappeared and "good" 
ideas turn to be either already done or really bad ideas.

So if you're really interested in btrfs, please stand from digging the 
on-disk format(believe me, this is the easiest way) and get familiar 
with a lot of kernel MM/VFS facilities.
(and that's what I used to do and are still doing)

Thanks,
Qu

>
> * dedup/dedupe/deduplication
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18  1:36 ` Qu Wenruo
@ 2016-01-18  3:10   ` Duncan
  2016-01-18  3:16     ` Qu Wenruo
  2016-01-20 14:53   ` Al
  1 sibling, 1 reply; 27+ messages in thread
From: Duncan @ 2016-01-18  3:10 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Mon, 18 Jan 2016 09:36:49 +0800 as excerpted:

>> dedup'ing data immediately when written to high-write-count data is
>> counter productive because no sooner has it been deduped then it is
>> rendered obsolete by another COW write.
> 
> And it seems that you are not familiar how kernel is caching data for
> filesystem.
> There is already kernel page cache for such case.
> No matter how many times you write, as long as you're doing buffered
> write the the data is not written to disk but cached by kernel, until
> either you triggered a manual sync or memory pressure hits threshold.

Not contradicting in general, but checking my own understanding here...

Doesn't the kernel write cache get synced by timeout as well as memory 
pressure and manual sync, with the timeouts found in
/proc/sys/vm/dirty_*_centisecs, with defaults of 5 seconds background and 
30 seconds higher priority foreground expiry?

Regardless, I agree, the kernel page-cache seriously mitigates the stated 
concerns.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18  3:10   ` Duncan
@ 2016-01-18  3:16     ` Qu Wenruo
  2016-01-18  3:51       ` Duncan
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2016-01-18  3:16 UTC (permalink / raw)
  To: Duncan, linux-btrfs



Duncan wrote on 2016/01/18 03:10 +0000:
> Qu Wenruo posted on Mon, 18 Jan 2016 09:36:49 +0800 as excerpted:
>
>>> dedup'ing data immediately when written to high-write-count data is
>>> counter productive because no sooner has it been deduped then it is
>>> rendered obsolete by another COW write.
>>
>> And it seems that you are not familiar how kernel is caching data for
>> filesystem.
>> There is already kernel page cache for such case.
>> No matter how many times you write, as long as you're doing buffered
>> write the the data is not written to disk but cached by kernel, until
>> either you triggered a manual sync or memory pressure hits threshold.
>
> Not contradicting in general, but checking my own understanding here...
>
> Doesn't the kernel write cache get synced by timeout as well as memory
> pressure and manual sync, with the timeouts found in
> /proc/sys/vm/dirty_*_centisecs, with defaults of 5 seconds background and
> 30 seconds higher priority foreground expiry?
>
> Regardless, I agree, the kernel page-cache seriously mitigates the stated
> concerns.
>
Yep, I forgot timeout. It can also be specified by per fs mount option 
"commit=".

But I never /proc/sys/vm/dirty_* interface before... I'd better check 
the code or add some debug pr_info to learn such behavior.

Thanks for pointing out this,
Qu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18  3:16     ` Qu Wenruo
@ 2016-01-18  3:51       ` Duncan
  2016-01-18 12:48         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 27+ messages in thread
From: Duncan @ 2016-01-18  3:51 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Mon, 18 Jan 2016 11:16:11 +0800 as excerpted:

> Duncan wrote on 2016/01/18 03:10 +0000:
>>
>> Doesn't the kernel write cache get synced by timeout as well as
>> memory pressure and manual sync, with the timeouts found in
>> /proc/sys/vm/dirty_*_centisecs, with defaults of 5 seconds
>> background and 30 seconds higher priority foreground expiry?
>>
> Yep, I forgot timeout. It can also be specified by per fs mount
> option "commit=".
> 
> But I never /proc/sys/vm/dirty_* interface before... I'd better
> check the code or add some debug pr_info to learn such behavior.

Checking a bit more my understanding, since you brought up the
btrfs "commit=" mount option.

I knew about the option previously, and obviously knew it worked in the 
same context as the page-cache stuff, but in my understanding the btrfs 
"commit=" mount option operates at the filesystem layer, not the general 
filesystem-vm layer controlled by /proc/sys/vm/dirty_*.  In my 
understanding, therefore, the two timeouts could effectively be added, 
yielding a maximum 1 minute (30 seconds btrfs default commit time plus 30 
seconds vm expiry) commit time.

But that has always been an unverified on my part fuzzy assumption.  The 
two times could be the same layer, with the btrfs mount option being a 
per-filesystem method of controlling the same thing that /proc/sys/vm/
dirty_expire_centisecs controls globally (as you seemed to imply above), 
or the two could be different layers but with the countdown times 
overlapping, both of which would result in a 30-second total timeout, 
instead of the 30+30=60 that I had assumed.

And while we're at it, how does /proc/sys/vm/vfs_cache_pressure play into 
all this?  I know the dirty_* and how the dirty_*bytes vs. dirty_*ratio 
vs. dirty_*centisecs thing works, but don't quite understand how 
vfs_cache_pressure fits in with dirty_*.

Of course if there's already a good writeup on the dirty_* vs 
vfs_cache_pressure question somewhere, a link would be fine.  But I doubt 
there's good info on how the btrfs commit= mount option fits into it all, 
as the btrfs option is relatively newer and it's likely I'd have seen 
that all ready, if it was out there.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-16 18:07   ` Rich Freeman
@ 2016-01-18 12:23     ` Austin S. Hemmelgarn
  2016-01-23 22:22       ` Mark Fasheh
  2016-01-20 14:49     ` Al
  1 sibling, 1 reply; 27+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-18 12:23 UTC (permalink / raw)
  To: Rich Freeman, Duncan; +Cc: Btrfs BTRFS

On 2016-01-16 13:07, Rich Freeman wrote:
> On Sat, Jan 16, 2016 at 9:10 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
>>
>>> Is there any urgency for dedup? What's wrong with storing the hash on
>>> disk with the block and having a separate process dedup the written data
>>> over time;
>>
>> There's actually uses for both inline and out-of-line[1] aka delayed
>> dedup.  Btrfs already has a number of independent products doing various
>> forms of out-of-line dedup, so what's missing and being developed now is
>> the inline dedup option, which being directly in the write processing,
>> must be handled by btrfs itself -- it can't be primarily done by third
>> parties with just a few kernel calls, like out-of-line dedup can.
>
> Does the out-of-line dedup option actually utilize stored hashes, or
> is it forced to re-read all the data to compute hashes?  If it is
> collecting checksums/etc is this done efficiently?
AFAIK, duperemove has the option to store block hashes in a database to 
save them between runs (I'm pretty sure that it invalidates hashes if 
the file containing the block changed, but I'm not certain).
>
> I think he is actually suggesting a hybrid approach where a bit of
> effort is done during operations to greatly streamline out-of-line
> deduplication.  I'm not sure how close we are to that already, or if
> any room for improvement remains.
There isn't any implementation I know of that does this.  In theory, it 
would be pretty easy if we could somehow get block checksums  from BTRFS 
in userspace.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18  3:51       ` Duncan
@ 2016-01-18 12:48         ` Austin S. Hemmelgarn
  2016-01-19  8:30           ` Duncan
  0 siblings, 1 reply; 27+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-18 12:48 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2016-01-17 22:51, Duncan wrote:
> Qu Wenruo posted on Mon, 18 Jan 2016 11:16:11 +0800 as excerpted:
>
>> Duncan wrote on 2016/01/18 03:10 +0000:
>>>
>>> Doesn't the kernel write cache get synced by timeout as well as
>>> memory pressure and manual sync, with the timeouts found in
>>> /proc/sys/vm/dirty_*_centisecs, with defaults of 5 seconds
>>> background and 30 seconds higher priority foreground expiry?
>>>
>> Yep, I forgot timeout. It can also be specified by per fs mount
>> option "commit=".
>>
>> But I never /proc/sys/vm/dirty_* interface before... I'd better
>> check the code or add some debug pr_info to learn such behavior.
>
> Checking a bit more my understanding, since you brought up the
> btrfs "commit=" mount option.
>
> I knew about the option previously, and obviously knew it worked in the
> same context as the page-cache stuff, but in my understanding the btrfs
> "commit=" mount option operates at the filesystem layer, not the general
> filesystem-vm layer controlled by /proc/sys/vm/dirty_*.  In my
> understanding, therefore, the two timeouts could effectively be added,
> yielding a maximum 1 minute (30 seconds btrfs default commit time plus 30
> seconds vm expiry) commit time.
In a way, yes, except the commit option controls when a transaction is 
committed, and thus how often the log tree gets cleared.  It's 
essentially saying 'ensure the filesystem is consistent without 
replaying a log at least this often'.  AFAIUI, this doesn't guarantee 
that you'll go that long without a transaction, but puts an upper bound 
on it.  Looking at it another way, it pretty much says that you don't 
care about losing the last n seconds of changes to the FS.

The sysctl values are a bit different, and control how long the kernel 
will wait in the VFS layer to try and submit a larger batch of writes at 
once, so that the block layer has more it can try to merge, and 
hopefully things get written out faster as a result.  IOW, it's a knob 
to control the VFS level write-back caching to try and tune for 
performance.  This also ties in with 
/proc/sys/vm/dirty_writeback_centisecs, which is how often after the 
expiration hits that the kernel will flush a chunk of the cache, and 
/proc/sys/vm/dirty_{background,}_{bytes,ratio} which puts an upper limit 
on how much data will be buffered before trying to flush it out to 
persistent storage.  You almost certainly want to change these, as they 
defaults to 10% of system RAM, which is why it often takes a ridiculous 
amount of time to unmount a flash drive that's been written to a lot. 
dirty_{ratio,bytes} control the per-process limit, and 
dirty_background_{ratio,bytes} control the system-wide limit.
>
> But that has always been an unverified on my part fuzzy assumption.  The
> two times could be the same layer, with the btrfs mount option being a
> per-filesystem method of controlling the same thing that /proc/sys/vm/
> dirty_expire_centisecs controls globally (as you seemed to imply above),
> or the two could be different layers but with the countdown times
> overlapping, both of which would result in a 30-second total timeout,
> instead of the 30+30=60 that I had assumed.
The two timers do overlap.
>
> And while we're at it, how does /proc/sys/vm/vfs_cache_pressure play into
> all this?  I know the dirty_* and how the dirty_*bytes vs. dirty_*ratio
> vs. dirty_*centisecs thing works, but don't quite understand how
> vfs_cache_pressure fits in with dirty_*.
vfs_cache_pressure controls how likely the kernel is to drop clean pages 
(the documentation says just dentries and inodes, but I'm relatively 
certain it's anything in the VFS cache) from the VFS cache to get memory 
to allocate.  The higher this is, the more likely the VFS cache is to 
get invalidated.  In general, you probably want to increase this on 
systems that have fast storage (like SSD's or really good SAS RAID 
arrays, 150 is usually a decent start), and decrease it if you have 
really slow storage (Like a Raspberry Pi for example).  Setting this too 
low (below about 50) however, will give you a very high chance of 
getting an OOM condition.
>
> Of course if there's already a good writeup on the dirty_* vs
> vfs_cache_pressure question somewhere, a link would be fine.  But I doubt
> there's good info on how the btrfs commit= mount option fits into it all,
> as the btrfs option is relatively newer and it's likely I'd have seen
> that all ready, if it was out there.
Documentation/sysctl/vm.txt in the kernel sources covers them, although 
the documentation is a bit sparse even there.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18 12:48         ` Austin S. Hemmelgarn
@ 2016-01-19  8:30           ` Duncan
  2016-01-19  9:14             ` Duncan
  2016-01-19 12:21             ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 27+ messages in thread
From: Duncan @ 2016-01-19  8:30 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Mon, 18 Jan 2016 07:48:13 -0500 as
excerpted:

> On 2016-01-17 22:51, Duncan wrote:
>>
>> Checking a bit more my understanding, since you brought up the btrfs
>> "commit=" mount option.
>>
>> I knew about the option previously, and obviously knew it worked in the
>> same context as the page-cache stuff, but in my understanding the btrfs
>> "commit=" mount option operates at the filesystem layer, not the
>> general filesystem-vm layer controlled by /proc/sys/vm/dirty_*.  In my
>> understanding, therefore, the two timeouts could effectively be added,
>> yielding a maximum 1 minute (30 seconds btrfs default commit time plus
>> 30 seconds vm expiry) commit time.
> 
> In a way, yes, except the commit option controls when a transaction is
> committed, and thus how often the log tree gets cleared.  It's
> essentially saying 'ensure the filesystem is consistent without
> replaying a log at least this often'.  AFAIUI, this doesn't guarantee
> that you'll go that long without a transaction, but puts an upper bound
> on it.  Looking at it another way, it pretty much says that you don't
> care about losing the last n seconds of changes to the FS.

Thanks.  That's the way I was treating it.

> The sysctl values are a bit different, and control how long the kernel
> will wait in the VFS layer to try and submit a larger batch of writes at
> once, so that the block layer has more it can try to merge, and
> hopefully things get written out faster as a result.  IOW, it's a knob
> to control the VFS level write-back caching to try and tune for
> performance.  This also ties in with
> /proc/sys/vm/dirty_writeback_centisecs, which is how often after the
> expiration hits that the kernel will flush a chunk of the cache, and
> /proc/sys/vm/dirty_{background,}_{bytes,ratio} which puts an upper limit
> on how much data will be buffered before trying to flush it out to
> persistent storage.  You almost certainly want to change these, as they
> defaults to 10% of system RAM, which is why it often takes a ridiculous
> amount of time to unmount a flash drive that's been written to a lot.
> dirty_{ratio,bytes} control the per-process limit, and
> dirty_background_{ratio,bytes} control the system-wide limit.

Got that too, and yes, I've been known to recommend to others changes to 
the now-days ridiculous 10% of system RAM buffer thing, as well. =:^)  
Random writes to spinning rust in particular may be 30 MiB/sec real-
world, and 10% of 16 GiB is 1.6 GiB, 50-some seconds worth of writeback.  
When the timeout is 30 seconds and the backlog is nearly double that, 
something's wrong.  I set mine to 3% foreground (~ half a gig @ 16 GiB) 
and 1% (~160 MiB) background when I upgraded to 16 GiB RAM, tho now I 
have fast SSDs, but didn't see a need to boost it back up, as half a GiB 
is quite enough to have unsynced in case of a crash anyway.

(Obviously once RAM goes above ~16 GiB, for systems not yet on fast SSD, 
the bytes values begin to make more sense to use than ratio, as 1% of RAM 
is simply no longer fine enough granularity.  But 1% of 16 GiB is ~163 
MiB, ~5 seconds worth @ 30 MiB/sec, so fine /enough/... barely.  The 3% 
foreground figure is then ~16 seconds worth of writeback, a bit 
uncomfortable if you're waiting on it, but comfortably below the 30 
second timeout and still at least tolerable in human terms, so not /too/ 
bad.  And as I said, for me the system and /home are now on fast SSD, so 
in practice the only time I'm worrying about spinning rust transfer 
backlogs is on the media and backups drive, which is still spinning 
rust.  And it's tolerable there, so the ratio knobs continue to be fine, 
for my own use.)

>> But that has always been an unverified on my part fuzzy assumption. 
>> The two times could be the same layer, with the btrfs mount option
>> being a per-filesystem method of controlling the same thing that
>> /proc/sys/vm/ dirty_expire_centisecs controls globally (as you seemed
>> to imply above), or the two could be different layers but with the
>> countdown times overlapping, both of which would result in a 30-second
>> total timeout, instead of the 30+30=60 that I had assumed.
> 
> The two timers do overlap.

Good to have it verified. =:^)  The difference between 30 seconds and a 
minute's worth of work lost in a crash can be quite a lot, if one was 
copying a big set of small files at the time.

>> And while we're at it, how does /proc/sys/vm/vfs_cache_pressure play
>> into all this?  I know the dirty_* and how the dirty_*bytes vs.
>> dirty_*ratio vs. dirty_*centisecs thing works, but don't quite
>> understand how vfs_cache_pressure fits in with dirty_*.

> vfs_cache_pressure controls how likely the kernel is to drop clean pages
> (the documentation says just dentries and inodes, but I'm relatively
> certain it's anything in the VFS cache) from the VFS cache to get memory
> to allocate.  The higher this is, the more likely the VFS cache is to
> get invalidated.  In general, you probably want to increase this on
> systems that have fast storage (like SSD's or really good SAS RAID
> arrays, 150 is usually a decent start), and decrease it if you have
> really slow storage (Like a Raspberry Pi for example).  Setting this too
> low (below about 50) however, will give you a very high chance of
> getting an OOM condition.

So vfs_cache_pressure only applies if you're out of "free" memory, and 
the kernel has to decide whether to dump cache or OOM, correct?  On 
systems with enough memory, and with stuff like the local package cache 
and/or multimedia on separate partitions that are mounted only when 
needed and unmounted when not, so actual system-and-apps plus buffers 
plus cache memory generally stays reasonably below total RAM, with 
reasonable ulimits and tmpfs maximum sizes set so apps can't go hog-wild, 
there's zero cache pressure so this setting doesn't apply at all... 
unless/until there's a bad kernel leak and/or several apps go somewhat 
wild, plus something's maximizing a few of those tmpfs, all at once, of 
course.

(As I write this system/app memory usage is ~2350 MiB, buffers 4 MiB, 
cache 7321 MiB, total usage ~9680 MiB, on a 16 GiB system.  That's with 
about three days uptime, after mounting the packages partition and 
remounting / rw and doing a bunch of builds, then umounting the pkgs 
partition, killing X and running a lib_users check to ensure no services 
are running on outdated deleted libs and need restarted, remounting / ro, 
and restarting X.  At some point I had the media partition mounted too, 
but now it's unmounted again, dropping that cache.  So in addition to 
cache memory which /could/ be dumped if I had to, I have 6+ GiB of 
entirely idle unused memory.  Nice as I don't have swap configured, so if 
I'm out of RAM, I'm out, but there's a lot of cache to dump first before 
it gets that bad.  Meanwhile, zero cache pressure, and 6+ GiB of spare 
RAM to use for apps/tmpfs/cache if I need it, before any cache dumps at 
all! =:^)

> Documentation/sysctl/vm.txt in the kernel sources covers them, although
> the documentation is a bit sparse even there.

I knew the kernel's proc documentation in Documentation/filesystems/
proc.txt, plus whatever outside resource it was that originally got me 
looking into the whole thing in the first place, I had the /proc/sys/vm/
dirty_* files and their usage covered.  But the sysctl/* doc files and 
the the vfs_cache_pressure proc file, not so much, and as I said I didn't 
understand how the btrfs commit= mount option fit into all of this.  So 
now I have a rather better understanding of how it all fits together. =:^)

Thanks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-19  8:30           ` Duncan
@ 2016-01-19  9:14             ` Duncan
  2016-01-19 12:28               ` Austin S. Hemmelgarn
  2016-01-19 12:21             ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 27+ messages in thread
From: Duncan @ 2016-01-19  9:14 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Tue, 19 Jan 2016 08:30:43 +0000 as excerpted:

> (As I write this system/app memory usage is ~2350 MiB, buffers 4 MiB,
> cache 7321 MiB, total usage ~9680 MiB, on a 16 GiB system.  That's with
> about three days uptime, after mounting the packages partition and
> remounting / rw and doing a bunch of builds, then umounting the pkgs
> partition, killing X and running a lib_users check to ensure no services
> are running on outdated deleted libs and need restarted, remounting
> / ro, and restarting X.  At some point I had the media partition
> mounted too, but now it's unmounted again, dropping that cache.  So in
> addition to cache memory which /could/ be dumped if I had to, I have
> 6+ GiB of entirely idle unused memory.  Nice as I don't have swap
> configured, so if I'm out of RAM, I'm out, but there's a lot of cache
> to dump first before it gets that bad.  Meanwhile, zero cache pressure,
> and 6+ GiB of spare RAM to use for apps/tmpfs/cache if I need it,
> before any cache dumps at all! =:^)

Oh, I also don't allow any crazy indexers, like kde's baloo or the older 
updatedb for (s)locate, to go crazy indexing everything, thereby wasting 
valuable cache memory on files I won't actually be using.  These things 
get shut down as soon as I discover new ones, and preferably get 
uninstalled, with dependencies on them turned off (on gentoo, via 
appropriate USE flag) as well.  On kde4 I was even carrying my own no-
semantic-desktop patches for awhile, when gentoo/kde decided they weren't 
going to support kde without semantic-desktop.  Fortunately they changed 
their minds.  I'm now finally updated to kde-frameworks5 with plasma5, 
and have baloo installed for that as I don't yet grok how to keep it off 
the system entirely in frameworks/plasma5, but it's definitely shut down 
as far as runtime goes.

There is a package indexer that runs, and of course syncing package 
updates loads all that in cache, but all that's on my packages partition, 
unmounted when I'm not actively doing package updates, etc, thereby 
freeing the package updates subsystem caches.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-19  8:30           ` Duncan
  2016-01-19  9:14             ` Duncan
@ 2016-01-19 12:21             ` Austin S. Hemmelgarn
  2016-01-20 15:12               ` Al
  1 sibling, 1 reply; 27+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-19 12:21 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2016-01-19 03:30, Duncan wrote:
> Austin S. Hemmelgarn posted on Mon, 18 Jan 2016 07:48:13 -0500 as
> excerpted:
>
>> On 2016-01-17 22:51, Duncan wrote:
>>>
>>> Checking a bit more my understanding, since you brought up the btrfs
>>> "commit=" mount option.
>>>
>>> I knew about the option previously, and obviously knew it worked in the
>>> same context as the page-cache stuff, but in my understanding the btrfs
>>> "commit=" mount option operates at the filesystem layer, not the
>>> general filesystem-vm layer controlled by /proc/sys/vm/dirty_*.  In my
>>> understanding, therefore, the two timeouts could effectively be added,
>>> yielding a maximum 1 minute (30 seconds btrfs default commit time plus
>>> 30 seconds vm expiry) commit time.
>>
>> In a way, yes, except the commit option controls when a transaction is
>> committed, and thus how often the log tree gets cleared.  It's
>> essentially saying 'ensure the filesystem is consistent without
>> replaying a log at least this often'.  AFAIUI, this doesn't guarantee
>> that you'll go that long without a transaction, but puts an upper bound
>> on it.  Looking at it another way, it pretty much says that you don't
>> care about losing the last n seconds of changes to the FS.
>
> Thanks.  That's the way I was treating it.
>
>> The sysctl values are a bit different, and control how long the kernel
>> will wait in the VFS layer to try and submit a larger batch of writes at
>> once, so that the block layer has more it can try to merge, and
>> hopefully things get written out faster as a result.  IOW, it's a knob
>> to control the VFS level write-back caching to try and tune for
>> performance.  This also ties in with
>> /proc/sys/vm/dirty_writeback_centisecs, which is how often after the
>> expiration hits that the kernel will flush a chunk of the cache, and
>> /proc/sys/vm/dirty_{background,}_{bytes,ratio} which puts an upper limit
>> on how much data will be buffered before trying to flush it out to
>> persistent storage.  You almost certainly want to change these, as they
>> defaults to 10% of system RAM, which is why it often takes a ridiculous
>> amount of time to unmount a flash drive that's been written to a lot.
>> dirty_{ratio,bytes} control the per-process limit, and
>> dirty_background_{ratio,bytes} control the system-wide limit.
>
> Got that too, and yes, I've been known to recommend to others changes to
> the now-days ridiculous 10% of system RAM buffer thing, as well. =:^)
> Random writes to spinning rust in particular may be 30 MiB/sec real-
> world, and 10% of 16 GiB is 1.6 GiB, 50-some seconds worth of writeback.
> When the timeout is 30 seconds and the backlog is nearly double that,
> something's wrong.  I set mine to 3% foreground (~ half a gig @ 16 GiB)
> and 1% (~160 MiB) background when I upgraded to 16 GiB RAM, tho now I
> have fast SSDs, but didn't see a need to boost it back up, as half a GiB
> is quite enough to have unsynced in case of a crash anyway.
Personally I usually just use small byte values (64MB for the system 
wide limit, and 4MB for the per-process limit).  I also do a decent 
amount of work with removable media (which takes longer to unmount the 
higher these are), and have good SSD's that do proper write-reordering 
and guarantee that writes will finish even if power dies in the middle, 
and don't care as much about write performance on my traditional disks 
(most of those are used as backing storage for VM's which can fit their 
entire working set in RAM, so having fast storage isn't as high priority 
for them).
>
> (Obviously once RAM goes above ~16 GiB, for systems not yet on fast SSD,
> the bytes values begin to make more sense to use than ratio, as 1% of RAM
> is simply no longer fine enough granularity.  But 1% of 16 GiB is ~163
> MiB, ~5 seconds worth @ 30 MiB/sec, so fine /enough/... barely.  The 3%
> foreground figure is then ~16 seconds worth of writeback, a bit
> uncomfortable if you're waiting on it, but comfortably below the 30
> second timeout and still at least tolerable in human terms, so not /too/
> bad.  And as I said, for me the system and /home are now on fast SSD, so
> in practice the only time I'm worrying about spinning rust transfer
> backlogs is on the media and backups drive, which is still spinning
> rust.  And it's tolerable there, so the ratio knobs continue to be fine,
> for my own use.)
>
>>> But that has always been an unverified on my part fuzzy assumption.
>>> The two times could be the same layer, with the btrfs mount option
>>> being a per-filesystem method of controlling the same thing that
>>> /proc/sys/vm/ dirty_expire_centisecs controls globally (as you seemed
>>> to imply above), or the two could be different layers but with the
>>> countdown times overlapping, both of which would result in a 30-second
>>> total timeout, instead of the 30+30=60 that I had assumed.
>>
>> The two timers do overlap.
>
> Good to have it verified. =:^)  The difference between 30 seconds and a
> minute's worth of work lost in a crash can be quite a lot, if one was
> copying a big set of small files at the time.
>
>>> And while we're at it, how does /proc/sys/vm/vfs_cache_pressure play
>>> into all this?  I know the dirty_* and how the dirty_*bytes vs.
>>> dirty_*ratio vs. dirty_*centisecs thing works, but don't quite
>>> understand how vfs_cache_pressure fits in with dirty_*.
>
>> vfs_cache_pressure controls how likely the kernel is to drop clean pages
>> (the documentation says just dentries and inodes, but I'm relatively
>> certain it's anything in the VFS cache) from the VFS cache to get memory
>> to allocate.  The higher this is, the more likely the VFS cache is to
>> get invalidated.  In general, you probably want to increase this on
>> systems that have fast storage (like SSD's or really good SAS RAID
>> arrays, 150 is usually a decent start), and decrease it if you have
>> really slow storage (Like a Raspberry Pi for example).  Setting this too
>> low (below about 50) however, will give you a very high chance of
>> getting an OOM condition.
>
> So vfs_cache_pressure only applies if you're out of "free" memory, and
> the kernel has to decide whether to dump cache or OOM, correct?  On
> systems with enough memory, and with stuff like the local package cache
> and/or multimedia on separate partitions that are mounted only when
> needed and unmounted when not, so actual system-and-apps plus buffers
> plus cache memory generally stays reasonably below total RAM, with
> reasonable ulimits and tmpfs maximum sizes set so apps can't go hog-wild,
> there's zero cache pressure so this setting doesn't apply at all...
> unless/until there's a bad kernel leak and/or several apps go somewhat
> wild, plus something's maximizing a few of those tmpfs, all at once, of
> course.
Kind of, it comes into play any time the kernel goes to reclaim memory, 
which is usually to complete higher order allocations in kernel space 
(like allocating big DMA buffers or similar stuff).  It's important to 
note that it's not usually a factor in dealing with an OOM condition 
(unless you lower it, in which case it can be a big contributing 
factor).  As an example, say you plug in a USB NIC, the kernel has to 
allocate a lot of different things to be able to work with it reliably, 
and /proc/sys/vfs_cache_pressure tells it how much to favor dropping 
bits of the VFS cache to satisfy those allocations as opposed to other 
methods (like memory compaction, which can be expensive on big systems).
>
> (As I write this system/app memory usage is ~2350 MiB, buffers 4 MiB,
> cache 7321 MiB, total usage ~9680 MiB, on a 16 GiB system.  That's with
> about three days uptime, after mounting the packages partition and
> remounting / rw and doing a bunch of builds, then umounting the pkgs
> partition, killing X and running a lib_users check to ensure no services
> are running on outdated deleted libs and need restarted, remounting / ro,
> and restarting X.  At some point I had the media partition mounted too,
> but now it's unmounted again, dropping that cache.  So in addition to
> cache memory which /could/ be dumped if I had to, I have 6+ GiB of
> entirely idle unused memory.  Nice as I don't have swap configured, so if
> I'm out of RAM, I'm out, but there's a lot of cache to dump first before
> it gets that bad.  Meanwhile, zero cache pressure, and 6+ GiB of spare
> RAM to use for apps/tmpfs/cache if I need it, before any cache dumps at
> all! =:^)
I wish I could get away with running without swap :)  My laptop only has 
8G of RAM, and I run Xen on my desktop, which means I have significantly 
less than the 32G of installed RAM to work with from my desktop VM 
there, and if I don't use swap, I often end up killing the machine 
trying to do some of the multimedia work I sometimes do.  OTOH, I've got 
swap on an SSD on both systems, which gets me ridiculous performance 
since I've got them configured to swap in and out pages in groups the 
size of an erase block on the SSD (which also means that it's not 
tearing up the SSD as much either).
>
>> Documentation/sysctl/vm.txt in the kernel sources covers them, although
>> the documentation is a bit sparse even there.
>
> I knew the kernel's proc documentation in Documentation/filesystems/
> proc.txt, plus whatever outside resource it was that originally got me
> looking into the whole thing in the first place, I had the /proc/sys/vm/
> dirty_* files and their usage covered.  But the sysctl/* doc files and
> the the vfs_cache_pressure proc file, not so much, and as I said I didn't
> understand how the btrfs commit= mount option fit into all of this.  So
> now I have a rather better understanding of how it all fits together. =:^)
Glad I could help.  The sysctl options are one of the places I would 
love to see better documented, I just don't have the time and enough 
knowledge of them to do so myself.  There's still a significant number 
that aren't documented there at all (lots of them in /proc/sys/kernel).


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-19  9:14             ` Duncan
@ 2016-01-19 12:28               ` Austin S. Hemmelgarn
  2016-01-19 15:40                 ` Duncan
  2016-01-20  8:32                 ` Brendan Hide
  0 siblings, 2 replies; 27+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-19 12:28 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2016-01-19 04:14, Duncan wrote:
> Oh, I also don't allow any crazy indexers, like kde's baloo or the older
> updatedb for (s)locate, to go crazy indexing everything, thereby wasting
> valuable cache memory on files I won't actually be using.  These things
> get shut down as soon as I discover new ones, and preferably get
> uninstalled, with dependencies on them turned off (on gentoo, via
> appropriate USE flag) as well.  On kde4 I was even carrying my own no-
> semantic-desktop patches for awhile, when gentoo/kde decided they weren't
> going to support kde without semantic-desktop.  Fortunately they changed
> their minds.  I'm now finally updated to kde-frameworks5 with plasma5,
> and have baloo installed for that as I don't yet grok how to keep it off
> the system entirely in frameworks/plasma5, but it's definitely shut down
> as far as runtime goes.
There's a reason I don't use KDE...
(Well, a couple actually, the indexing getting pulled in is only part of 
it, I also dislike the all-or-nothing packaging (everything seems to 
depend on everything else), and having to update the whole thing in 
lock-step; somewhat ironically, GNOME has the same issues these days, so 
I don't use that either).

That aside, it's probably worth noting that updatedb used by 
{,m,s}locate only indexes metadata, and it does so a lot more 
efficiently than most of the desktop search-engine indexers out there, 
so it's not quite as bad as baloo or tracker or some of the other 
options.  Unlike those, updatedb pretty much just calls stat on 
everything it's told to index, which takes time, but is not particularly 
bad for your cache if you're just running it on your home directory.
>
> There is a package indexer that runs, and of course syncing package
> updates loads all that in cache, but all that's on my packages partition,
> unmounted when I'm not actively doing package updates, etc, thereby
> freeing the package updates subsystem caches.
For those who might be interested, autofs is wonderful for handling 
stuff like this.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-19 12:28               ` Austin S. Hemmelgarn
@ 2016-01-19 15:40                 ` Duncan
  2016-01-20  8:32                 ` Brendan Hide
  1 sibling, 0 replies; 27+ messages in thread
From: Duncan @ 2016-01-19 15:40 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 19 Jan 2016 07:28:44 -0500 as
excerpted:

> There's a reason I don't use KDE...
> (Well, a couple actually, the indexing getting pulled in is only part of
> it, I also dislike the all-or-nothing packaging (everything seems to
> depend on everything else), and having to update the whole thing in
> lock-step; somewhat ironically, GNOME has the same issues these days, so
> I don't use that either).

Rather OT for the list, but... while I'm still using a kde desktop, it's 
pretty stripped down.  As I said, until I installed plasma5, I not only 
had nepomuk/baloo turned off, but I had them stripped out at build-time 
as well.  And I expect to get there again.  I'm already package.providing 
udisks as a plasma5 dep as it's runtime-only, and expect to strip out 
polkit similarly as I've seen indications it's runtime-only, as well as 
baloo again eventually, and networkmanager.

Late in the kde3 cycle I was actually contemplating getting rid of the 
last few gtk2 apps I had and switching to qt/kde only.  But over the 
course of kde4, mostly at the beginning when kde4 was still so broken but 
kde3 wasn't supported any more, but later (~4.6) for konqueror when it 
became apparent its devs considered it little more than a toy, and kmail 
when it jumped the akonadi shark, I switched off of kde for nearly 
everything except the desktop itself, superkaramba (which is being 
dropped tho plasma supposedly supports it, tho I could never get that to 
work properly with my theme back before I decided to give up and just use 
superkaramba, so I'm not sure whether I can get plasma to work there or 
not and I might have to switch to gkrellm or some such), a few games 
which I can give up or there's alternatives for, and dolphin and gwenview 
as file and image managers, with gimv (GImageViewer) already installed as 
an alternative for the latter and pretty much any graphic file manager 
workable as a dolphin replacement since I do much of my file management 
in the terminal using either CLI or the ncurses-based mc anyway.  So the 
situation has nearly reversed from that of the late kde3 cycle, and now 
I'd find it much easier to dump kde than gtk2, as it's primarily a 
relatively lite kde desktop that's my not immediately replaceable kde 
tools now.  And I'm sure I could find a workable alternative to it too, 
if I had too.  Enlightenment has always been on my list to try, and a 
(likely heavily customized by the time I'm done with it) qt-based lxde is 
on my short list as well.

So we'll see how plasma5 develops.  Meanwhile, there's the x11/wayland 
switch coming up, which could yet rock the Linux desktop environment 
landscape pretty wildly, changing it as we know it and putting entirely 
different environments at the forefront a few years from now.  But other 
than gnome, which isn't an option for me due to their "our way is the 
only correct way" attitude, and kde, I simply don't know enough about 
what other environments are doing with it to have the foggiest, at this 
point, particularly if I don't choose to stay with kde/plasma as my 
desktop thru that transition, which is a possibility.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-19 12:28               ` Austin S. Hemmelgarn
  2016-01-19 15:40                 ` Duncan
@ 2016-01-20  8:32                 ` Brendan Hide
  1 sibling, 0 replies; 27+ messages in thread
From: Brendan Hide @ 2016-01-20  8:32 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Duncan, linux-btrfs

On 1/19/2016 2:28 PM, Austin S. Hemmelgarn wrote:
> That aside, it's probably worth noting that updatedb used by 
> {,m,s}locate only indexes metadata, and it does so a lot more 
> efficiently than most of the desktop search-engine indexers out there, 
> so it's not quite as bad as baloo or tracker or some of the other 
> options.  Unlike those, updatedb pretty much just calls stat on 
> everything it's told to index, which takes time, but is not 
> particularly bad for your cache if you're just running it on your home 
> directory.
Related Project Idea:
Snapshot-aware updatedb/locate -> 
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Snapshot-aware_updatedb.2Flocate

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-16 14:10 ` Duncan
  2016-01-16 18:07   ` Rich Freeman
@ 2016-01-20 14:43   ` Al
  2016-01-21  8:23     ` Qu Wenruo
  1 sibling, 1 reply; 27+ messages in thread
From: Al @ 2016-01-20 14:43 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan <at> cox.net> writes:

> 
> Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
> 

That it does, Duncan, thank you!

I was suggesting, albeit implicitly, that unless you're really short of
block dev space (!), which is a pretty naff dedup strategy, dedup isn't time
critical AFAIC(See). My server memory is not huge and I'd happily let it
chug away dedup'ing than have the whole thing run like a dog for lack of memory.

I'm looking forward to using it; keep up the very good work.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-16 18:07   ` Rich Freeman
  2016-01-18 12:23     ` Austin S. Hemmelgarn
@ 2016-01-20 14:49     ` Al
  1 sibling, 0 replies; 27+ messages in thread
From: Al @ 2016-01-20 14:49 UTC (permalink / raw)
  To: linux-btrfs

Rich Freeman <r-btrfs <at> thefreemanclan.net> writes:

> I think he is actually suggesting a hybrid approach where a bit of
> effort is done during operations to greatly streamline out-of-line
> deduplication.  I'm not sure how close we are to that already, or if
> any room for improvement remains.
> 
> --

Hopefully, we can tweak the split between in-memory and on blkdev. I don't
see myself needing in-memory much (cc'd email with attachments perhaps?).




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18  1:36 ` Qu Wenruo
  2016-01-18  3:10   ` Duncan
@ 2016-01-20 14:53   ` Al
  1 sibling, 0 replies; 27+ messages in thread
From: Al @ 2016-01-20 14:53 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo <quwenruo <at> cn.fujitsu.com> writes:

> 
> 
> Al wrote on 2016/01/16 12:27 +0000:
> > Hi,

Thank you for taking the time to reply at length. Most helpful.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-19 12:21             ` Austin S. Hemmelgarn
@ 2016-01-20 15:12               ` Al
  2016-01-20 18:21                 ` Duncan
  0 siblings, 1 reply; 27+ messages in thread
From: Al @ 2016-01-20 15:12 UTC (permalink / raw)
  To: linux-btrfs

Sometimes it's the odd OT posts that are the most interesting!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-20 15:12               ` Al
@ 2016-01-20 18:21                 ` Duncan
  0 siblings, 0 replies; 27+ messages in thread
From: Duncan @ 2016-01-20 18:21 UTC (permalink / raw)
  To: linux-btrfs

Al posted on Wed, 20 Jan 2016 15:12:09 +0000 as excerpted:

> Sometimes it's the odd OT posts that are the most interesting!

Now you got me going philosophical.  =:^)  The following may be 
interesting, but it's only on topic in a rather "meta" sense.  Many will 
wish to skip.

That's actually what has kept me "addicted" to newsgroups and mailing 
lists (which I do via gmane.org's list2news service as newsgroups) over 
the decades.  Just when a topic is getting boring and you're thinking 
about unsubscribing, along comes something entirely OT and unpredictable 
that's really helpful in some other area, either right now, or to be 
tucked away for later, either for myself or to pass on to someone else 
who it can help in some list/news discussion.

Plus, this being a kernel and filesystem related list, the depth of 
technical knowledge available here is simply amazing.

Nearly two decades ago now, I had an experience every technically 
inclined geek needs to have at some point, as it would solve a lot of 
problems.  To that point, I had often been among the most technically 
knowledgeable in many of my online discussions, even tho on the MSIE 
groups especially, some of them were responsible for an impressive number 
of boxes.  Then I ended up on a DSL ISP called Speakeasy, which at the 
time was pretty small, but growing, having originated out of a Seattle 
tea and coffee house (!!) of the same name.  A few years later it had 
gone to **** and I switched, but for a few years, it had some /really/ 
knowledgeable Unix/Linux/BSD folks, including one with direct commit 
access to one of the BSDs.  They were an /immense/ help when I was 
switching to Linux, but there's a rather different point here.

Like I said, until that point I was used to being one of the higher 
technically literate folks around, to the point even people in charge of 
fleets of thousands of (MS-based) computers were taking my advice on the 
then new IE4 and 5, MS Active Desktop, W98, etc.  But on that ISP, I very 
quickly learned how little I actually knew, becoming technically speaking 
the newbie.  For the first time in my life I couldn't simply make 
statements about how I thought the technology at hand worked and have 
people take them as truth because it was enough out of their realm they 
had no way to question it.  I was challenged on my statements, and had to 
back down a couple times before I very quickly learned to qualify things 
I wasn't sure of, as on that ISP's newsgroup, the tables really were 
turned and I was the technical know-nothing, at least compared to the 
knowledge and experience of these guys.

That's an incredibly valuable life lesson and experience to learn/have.  
Since then, as I've watched the various larger than life technical 
personalities and seen the various arguments, too many of which end up 
with someone with incredible technical skills leaving, I think back, and 
wish they could have had a similar experience somewhat earlier.  I'm 
absolutely sure if they had, that we'd not have the acrimonious forks, 
etc, that so often happen in the FLOSS world, the problem being that like 
me back then, so many technical leaders are simply used to being able to 
make technical statements unquestioned, and simply don't have the skills 
to deal with people at the same level actually being able to question 
them and their statements at their own level or above, because other than 
these rare cases which all too often end up going nuclear, there's simply 
no one at their level, /able/ to question them and to hold them to proper 
accountability.  Were they to have had at some earlier point an 
experience like I had at Speakeasy to (nicely but firmly) put them in 
their place as I was put in mine...  Of course, the Asperger Syndrome the 
highly technically inclined often have, to one degree or another, doesn't 
help...

Anyway, on this list I'm in very much the same position, dealing with 
people significantly above my own level, an otherwise somewhat rare 
experience in my life, almost non-existent "in real life", still rare, 
but much less so, online.  But I learned from that earlier experience, 
which is why you'll so often see "not a dev only a user and list regular" 
disclaimers on my posts, as well as many /many/ more "AFAIK", "I 
believe", "based on what I've seen on-list", "if a dev says different, 
listen to them, not me", etc.

And actually, learning to add those qualifiers has saved my *** a few 
times in other contexts, both online and off, as well, allowing me a 
graceful out when otherwise I'd have been forced to either defend a wrong 
position I backed myself into, or take a humiliating defeat.  Just one 
little AFAIK or "I believe" makes it /so/ much easier to back down, if it 
comes to that.  =:^)

Meanwhile, as here I /am/ among those well above my own level, I'm not 
afraid to ask the opportunistic question when I don't know, either, as 
this subthread demonstrates.

So let the development... and learning, continue! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-20 14:43   ` Al
@ 2016-01-21  8:23     ` Qu Wenruo
  2016-01-21 14:53       ` Al
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2016-01-21  8:23 UTC (permalink / raw)
  To: Al, linux-btrfs



Al wrote on 2016/01/20 14:43 +0000:
> Duncan <1i5t5.duncan <at> cox.net> writes:
>
>>
>> Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
>>
>
> That it does, Duncan, thank you!
>
> I was suggesting, albeit implicitly, that unless you're really short of
> block dev space (!), which is a pretty naff dedup strategy, dedup isn't time
> critical AFAIC(See). My server memory is not huge and I'd happily let it
> chug away dedup'ing than have the whole thing run like a dog for lack of memory.
>
> I'm looking forward to using it; keep up the very good work.
>

The design of providing two backends is for different use case.

If you don't thinkg inmemory is good, then just use ondisk.
Even inmemory doesn't seems useful for you, there is still some cases 
that you doesn't know.

The design of inmemory is not to save your block dev space, but to limit 
the overhead of write to a consist value.

If one day you just want to write 64K data, but you need to randomly 
read 128K metadata and found a hash miss, and do the real write, then 
you may understand the meaning of inmemory backend.



To conclude, something meaningless to you doesn't mean it's meaningless 
for everyone.
If you really think the design is naff, I'm very glad if you can provide 
a better one.

Thanks,
Qu

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-21  8:23     ` Qu Wenruo
@ 2016-01-21 14:53       ` Al
  2016-01-21 17:23         ` Chris Murphy
  0 siblings, 1 reply; 27+ messages in thread
From: Al @ 2016-01-21 14:53 UTC (permalink / raw)
  To: linux-btrfs

>> Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
> >
> > .. unless you're really short of
> > block dev space (!), which is a pretty naff dedup strategy,

> > I'm looking forward to using it; keep up the very good work.

> If you really think the design is naff, I'm very glad if you can provide 
> a better one.

Wenruo, I would suggest that you concentrate on your English comprehension
before you reply in such a manner. You also appear to have replied to a
thank-you message directed at another person.

Address your emotional problems in a more appropriate place.

As I said before,

> > I'm looking forward to using [dedup]; keep up the very good work.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-21 14:53       ` Al
@ 2016-01-21 17:23         ` Chris Murphy
  2016-01-22 11:33           ` Al
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Murphy @ 2016-01-21 17:23 UTC (permalink / raw)
  To: Al; +Cc: Btrfs BTRFS

On Thu, Jan 21, 2016 at 7:53 AM, Al <6401e46d@opayq.com> wrote:
>>> Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
>> >
>> > .. unless you're really short of
>> > block dev space (!), which is a pretty naff dedup strategy,
>
>> > I'm looking forward to using it; keep up the very good work.
>
>> If you really think the design is naff, I'm very glad if you can provide
>> a better one.
>
> Wenruo, I would suggest that you concentrate on your English comprehension
> before you reply in such a manner.

Going back and rereading the "naff" comment in context, I find it
confusing. So I guess my English comprehension requires concentration
also. I'd like to think you're saying that being short of block device
space while relying on on-disk hash table dedup (rather than
in-memory) is not a good idea for the user to do to himself? I can't
tell, but if that table is in its own tree, the user isn't likely to
run into that problem. Anyway, naff has a negative connotation to it
so it sounds like it's a backhanded criticism. Maybe there's something
being lost in translation between British and American English.

.
>
> Address your emotional problems in a more appropriate place.

You just stepped into the same pile of poo you're accusing Qu of stepping in.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-21 17:23         ` Chris Murphy
@ 2016-01-22 11:33           ` Al
  2016-01-23  2:44             ` Chris Murphy
  2016-02-02  2:55             ` Qu Wenruo
  0 siblings, 2 replies; 27+ messages in thread
From: Al @ 2016-01-22 11:33 UTC (permalink / raw)
  To: linux-btrfs

With respect Chris, if you intentionally ignore all of the other context
clues ("keep up the great work", "looking forward to using it"), you could
come to that conclusion. You'd have to try really really hard.

I wasn't criticising your buddy in anyway, quite the opposite. Can we move
on, now? There are better ways of spending our time.







^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-22 11:33           ` Al
@ 2016-01-23  2:44             ` Chris Murphy
  2016-02-02  2:55             ` Qu Wenruo
  1 sibling, 0 replies; 27+ messages in thread
From: Chris Murphy @ 2016-01-23  2:44 UTC (permalink / raw)
  To: Al; +Cc: Btrfs BTRFS

On Fri, Jan 22, 2016 at 4:33 AM, Al <6401e46d@opayq.com> wrote:
> With respect Chris, if you intentionally ignore all of the other context
> clues ("keep up the great work", "looking forward to using it"), you could
> come to that conclusion. You'd have to try really really hard.
>
> I wasn't criticising your buddy in anyway, quite the opposite. Can we move
> on, now? There are better ways of spending our time.

We can move on anytime you like. But you're the one bringing up
perceived emotional problems of others, so when you do that, your own
emotional problems are completely in-scope as well. Don't like that?
Then don't bring it up, just let it go next time.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-18 12:23     ` Austin S. Hemmelgarn
@ 2016-01-23 22:22       ` Mark Fasheh
  0 siblings, 0 replies; 27+ messages in thread
From: Mark Fasheh @ 2016-01-23 22:22 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Rich Freeman, Duncan, Btrfs BTRFS

On Mon, Jan 18, 2016 at 07:23:17AM -0500, Austin S. Hemmelgarn wrote:
> On 2016-01-16 13:07, Rich Freeman wrote:
> >On Sat, Jan 16, 2016 at 9:10 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> >>Al posted on Sat, 16 Jan 2016 12:27:16 +0000 as excerpted:
> >>
> >>>Is there any urgency for dedup? What's wrong with storing the hash on
> >>>disk with the block and having a separate process dedup the written data
> >>>over time;
> >>
> >>There's actually uses for both inline and out-of-line[1] aka delayed
> >>dedup.  Btrfs already has a number of independent products doing various
> >>forms of out-of-line dedup, so what's missing and being developed now is
> >>the inline dedup option, which being directly in the write processing,
> >>must be handled by btrfs itself -- it can't be primarily done by third
> >>parties with just a few kernel calls, like out-of-line dedup can.
> >
> >Does the out-of-line dedup option actually utilize stored hashes, or
> >is it forced to re-read all the data to compute hashes?  If it is
> >collecting checksums/etc is this done efficiently?
> AFAIK, duperemove has the option to store block hashes in a database
> to save them between runs (I'm pretty sure that it invalidates
> hashes if the file containing the block changed, but I'm not
> certain).

Yes, duperemove can use a hashfile (this is the recommened way of running
it). right now they're more or less temporary storage for one run (though as
you noted you can reuse the hashfile later).

The feature to rescan only the changed parts of a filesystem (and reuse the
hashfile) is in development and will be available with the next major
release.


> >I think he is actually suggesting a hybrid approach where a bit of
> >effort is done during operations to greatly streamline out-of-line
> >deduplication.  I'm not sure how close we are to that already, or if
> >any room for improvement remains.
> There isn't any implementation I know of that does this.  In theory,
> it would be pretty easy if we could somehow get block checksums
> from BTRFS in userspace.

btrfs block checksums are there for us to see if a block was corrupted.
It is generally agreed that they will be colliding a lot which would cause
unnecessary file reads during the dedupe ioctl.

Also, if we get them from the FS we can only get them in blocksized chunks
which at 4K make dedupe expensive. Duperemove for example, defaults to 128K
chunks for this reason.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
  2016-01-22 11:33           ` Al
  2016-01-23  2:44             ` Chris Murphy
@ 2016-02-02  2:55             ` Qu Wenruo
  1 sibling, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-02-02  2:55 UTC (permalink / raw)
  To: Al, linux-btrfs

You're right, I misunderstood the mail and get a little emotional.

And sorry for that inappropriate reply.


BTW, the new patchset will be sent to btrfs mail list in coming hour, if 
you're interested in inband de-dup, you can try the following github 
repos to build and test in-band de-dup:

Kernel
https://github.com/adam900710/linux.git wang_dedup

Btrfs-progs
https://github.com/adam900710/btrfs-progs.git dedup

And if you have any concern on the implementation or the dedup related 
Document, I'm happy to hear.
(I was a little emotional just because some unexpected bugs that time.)

Thanks,
Qu


Al wrote on 2016/01/22 11:33 +0000:
> With respect Chris, if you intentionally ignore all of the other context
> clues ("keep up the great work", "looking forward to using it"), you could
> come to that conclusion. You'd have to try really really hard.
>
> I wasn't criticising your buddy in anyway, quite the opposite. Can we move
> on, now? There are better ways of spending our time.
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-02-02  2:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-16 12:27 Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls Al
2016-01-16 14:10 ` Duncan
2016-01-16 18:07   ` Rich Freeman
2016-01-18 12:23     ` Austin S. Hemmelgarn
2016-01-23 22:22       ` Mark Fasheh
2016-01-20 14:49     ` Al
2016-01-20 14:43   ` Al
2016-01-21  8:23     ` Qu Wenruo
2016-01-21 14:53       ` Al
2016-01-21 17:23         ` Chris Murphy
2016-01-22 11:33           ` Al
2016-01-23  2:44             ` Chris Murphy
2016-02-02  2:55             ` Qu Wenruo
2016-01-18  1:36 ` Qu Wenruo
2016-01-18  3:10   ` Duncan
2016-01-18  3:16     ` Qu Wenruo
2016-01-18  3:51       ` Duncan
2016-01-18 12:48         ` Austin S. Hemmelgarn
2016-01-19  8:30           ` Duncan
2016-01-19  9:14             ` Duncan
2016-01-19 12:28               ` Austin S. Hemmelgarn
2016-01-19 15:40                 ` Duncan
2016-01-20  8:32                 ` Brendan Hide
2016-01-19 12:21             ` Austin S. Hemmelgarn
2016-01-20 15:12               ` Al
2016-01-20 18:21                 ` Duncan
2016-01-20 14:53   ` Al

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.