linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* light weight write barriers
       [not found] <CALwJ=MzHjAOs4J4kGH6HLdwP8E88StDWyAPVumNg9zCWpS9Tdg@mail.gmail.com>
@ 2012-10-10 17:17 ` Andi Kleen
  2012-10-11 16:32   ` [sqlite] " 杨苏立 Yang Su Li
       [not found]   ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>
  0 siblings, 2 replies; 58+ messages in thread
From: Andi Kleen @ 2012-10-10 17:17 UTC (permalink / raw)
  To: linux-kernel, sqlite-users, linux-fsdevel, drh

Richard Hipp writes:
>
> We would really, really love to have some kind of write-barrier that is
> lighter than fsync().  If there is some method other than fsync() for
> forcing a write-barrier on Linux that we don't know about, please enlighten
> us.

Could you list the requirements of such a light weight barrier?
i.e. what would it need to do minimally, what's different from
fsync/fdatasync ?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-10 17:17 ` light weight write barriers Andi Kleen
@ 2012-10-11 16:32   ` 杨苏立 Yang Su Li
  2012-10-11 17:41     ` Christoph Hellwig
  2012-10-23 19:53     ` Vladislav Bolkhovitin
       [not found]   ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>
  1 sibling, 2 replies; 58+ messages in thread
From: 杨苏立 Yang Su Li @ 2012-10-11 16:32 UTC (permalink / raw)
  To: General Discussion of SQLite Database; +Cc: linux-kernel, linux-fsdevel, drh

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before....

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?
(I believe Windows does this to an extent, but not quite sure).

Thanks a lot

Suli


On Wed, Oct 10, 2012 at 12:17 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Richard Hipp writes:
>>
>> We would really, really love to have some kind of write-barrier that is
>> lighter than fsync().  If there is some method other than fsync() for
>> forcing a write-barrier on Linux that we don't know about, please enlighten
>> us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]   ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>
@ 2012-10-11 16:38     ` Nico Williams
  2012-10-11 16:48       ` Nico Williams
  0 siblings, 1 reply; 58+ messages in thread
From: Nico Williams @ 2012-10-11 16:38 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Andi Kleen, linux-fsdevel, linux-kernel, drh

On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp <drh@sqlite.org> wrote:
>> Could you list the requirements of such a light weight barrier?
>> i.e. what would it need to do minimally, what's different from
>> fsync/fdatasync ?
>
> For SQLite, the write barrier needs to involve two separate inodes.  The
> requirement is this:

...

> Note also that when fsync() works as advertised, SQLite transactions are
> ACID.  But when fsync() is reduced to a write-barrier, we loss the D
> (durable) and transactions are only ACI.  In our experience, nobody really
> cares very much about durable across a power-loss.  People are mainly
> interested in Atomic, Consistent, and Isolated.  If you take a power loss
> and then after reboot you find the 10 seconds of work prior to the power
> loss is missing, nobody much cares about that as long as all of the prior
> work is still present and consistent.

There is something you can do: use a combination of COW on-disk
formats in such a way that it's possible to detect partially-committed
transactions and rollback to the last good known root, and
backgrounded fsync()s (i.e., in a separate thread, without waiting for
the fsync() to complete).

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-11 16:38     ` Nico Williams
@ 2012-10-11 16:48       ` Nico Williams
  0 siblings, 0 replies; 58+ messages in thread
From: Nico Williams @ 2012-10-11 16:48 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Andi Kleen, linux-fsdevel, linux-kernel, drh

To expand a bit, the on-disk format needs to allow the roots of N of
the last transactions to be/remain reachable at all times.  At open
time you look for the latest transaction, verify that it has been
written[0] completely, then use it, else look for the preceding
transaction, verify it, and so on.

N needs to be at least 2: the last and the preceding transactions.  No
blocks should be freed or reused for any transactions still in use or
possible use (e.g., for power failure recovery).  For high read
concurrency you can allow connections to lock a past transaction so
that no blocks are freed that are needed to access the DB at that
state.

This all goes back to 1980s DB and filesystem concepts.  See, for
example, the BSD4.4 Log Structure Filesystem.  (I mention this in case
there are concerns about patents, though IANAL and I make no
particular assertions here other than that there is plenty of old
prior art and expired patents that can probably be used to obtain
sufficient certainty as to the patent law risks in the approach
described herein.)

[0] E.g., check a transaction block manifest and check that those
blocks were written correctly; or traverse the tree looking for
differences to the previous transaction; this may require checking
block contents checksums.

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-11 16:32   ` [sqlite] " 杨苏立 Yang Su Li
@ 2012-10-11 17:41     ` Christoph Hellwig
  2012-10-23 19:53     ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2012-10-11 17:41 UTC (permalink / raw)
  To: ????????? Yang Su Li
  Cc: General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh

On Thu, Oct 11, 2012 at 11:32:27AM -0500, ????????? Yang Su Li wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before....

It does not.  Except for the legacy mount option naming there is no such
thing as a barrier in Linux these days.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-11 16:32   ` [sqlite] " 杨苏立 Yang Su Li
  2012-10-11 17:41     ` Christoph Hellwig
@ 2012-10-23 19:53     ` Vladislav Bolkhovitin
  2012-10-24 21:17       ` Nico Williams
  2012-10-25  5:14       ` Theodore Ts'o
  1 sibling, 2 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-23 19:53 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh

杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before....
>
> I mean, most of the time, we only need some ordering in writes; not
> complete order, but partial,very simple topological order. And a
> barrier seems to be a heavy weighted solution to achieve this anyway:
> you have to finish all writes before the barrier, then start all
> writes issued after the barrier. That is some ordering which is much
> stronger than what we need, isn't it?
>
> As most of the time the order we need do not involve too many blocks
> (certainly a lot less than all the cached blocks in the system or in
> the disk's cache), that topological order isn't likely to be very
> complicated, and I image it could be implemented efficiently in a
> modern device, which already has complicated caching/garbage
> collection/whatever going on internally. Particularly, it seems not
> too hard to be implemented on top of SCSI's ordered/simple task mode?

Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully "fill the pipe", using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.

In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).

But historically for some reason Linux storage developers were stuck with 
"barriers" concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).

Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.

Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-23 19:53     ` Vladislav Bolkhovitin
@ 2012-10-24 21:17       ` Nico Williams
  2012-10-24 22:03         ` david
  2012-10-27  1:52         ` Vladislav Bolkhovitin
  2012-10-25  5:14       ` Theodore Ts'o
  1 sibling, 2 replies; 58+ messages in thread
From: Nico Williams @ 2012-10-24 21:17 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh

On Tue, Oct 23, 2012 at 2:53 PM, Vladislav Bolkhovitin
<vvvvvst@gmail.com> wrote:
>> As most of the time the order we need do not involve too many blocks
>> (certainly a lot less than all the cached blocks in the system or in
>> the disk's cache), that topological order isn't likely to be very
>> complicated, and I image it could be implemented efficiently in a
>> modern device, which already has complicated caching/garbage
>> collection/whatever going on internally. Particularly, it seems not
>> too hard to be implemented on top of SCSI's ordered/simple task mode?

If you have multiple layers involved (e.g., SQLite then the
filesystem, and if the filesystem is spread over multiple storage
devices), and if transactions are not bounded, and on top of that if
there are other concurrent writers to the same filesystem (even if not
the same files) then the set of blocks to write and internal ordering
can get complex.  In practice filesystems try to break these up into
large self-consistent chunks and write those -- ZFS does this, for
example -- and this is aided by the lack of transactional semantics in
the filesystem.

For SQLite with a VFS that talks [i]SCSI directly then things could be
much more manageable as there's only one write transaction in progress
at any given time.  But that's not realistic, except, perhaps, in some
embedded systems.

> Yes, SCSI has full support for ordered/simple commands designed exactly for
> that task: [...]
>
> [...]
>
> But historically for some reason Linux storage developers were stuck with
> "barriers" concept, which is obviously not the same as ORDERED commands,
> hence had a lot troubles with their ambiguous semantic. As far as I can tell
> the reason of that was some lack of sufficiently deep SCSI understanding
> (how to handle errors, believe that ACA is something legacy from parallel
> SCSI times, etc.).

Barriers are a very simple abstraction, so there's that.

> Hopefully, eventually the storage developers will realize the value behind
> ordered commands and learn corresponding SCSI facilities to deal with them.
> It's quite easy to demonstrate this value, if you know where to look at and
> not blindly refusing such possibility. I have already tried to explain it a
> couple of times, but was not successful.

Exposing ordering of lower-layer operations to filesystem applications
is a non-starter.  About the only reasonable thing to do with a
filesystem is add barrier operations.  I know, you're talking about
lower layer capabilities, and SQLite could talk to that layer
directly, but let's face it: it's not likely to.

> Before that happens, people will keep returning again and again with those
> simple questions: why the queue must be flushed for any ordered operation?
> Isn't is an obvious overkill?

That [cache flushing] is not what's being asked for here.  Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 21:17       ` Nico Williams
@ 2012-10-24 22:03         ` david
  2012-10-25  0:20           ` Nico Williams
  2012-10-25  5:42           ` Theodore Ts'o
  2012-10-27  1:52         ` Vladislav Bolkhovitin
  1 sibling, 2 replies; 58+ messages in thread
From: david @ 2012-10-24 22:03 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, 24 Oct 2012, Nico Williams wrote:

>> Before that happens, people will keep returning again and again with those
>> simple questions: why the queue must be flushed for any ordered operation?
>> Isn't is an obvious overkill?
>
> That [cache flushing] is not what's being asked for here.  Just a
> light-weight barrier.  My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,
> d) have an array of well-known ubberblocks large enough to accommodate
> as many transactions as possible without having to wait for any one
> fsync() to complete, d) do not reclaim space from any one past
> transaction until at least one subsequent transaction is fully
> committed.  This obtains ACI- transaction semantics (survives power
> failures but without durability for the last N transactions at
> power-failure time) without requiring changes to the OS at all, and
> with support for delayed D (durability) notification.

I'm doing some work with rsyslog and it's disk-baded queues and there is a 
similar issue there. The good news is that we can have a version that is 
linux specific (rsyslog is used on other OSs, but there is an existing 
queue implementation that they can use, if the faster one is linux-only, 
but is significantly faster, that's just a win for Linux)

Like what is being described for sqlite, loosing the tail end of the 
messages is not a big problem under normal conditions. But there is a need 
to be sure that what is there is complete up to the point where it's lost.

this is similar in concept to write-ahead-logs done for databases (without 
the absolute durability requirement)

1. new messages arrive and get added to the end of the queue file.

2. a thread updates the queue to indicate that it is in the process 
of delivering a block of messages

3. the thread updates the queue to indicate that the block of messages has 
been delivered

4. garbage collection happens to delete the old messages to free up space 
(if queues go into files, this can just be to limit the file size, 
spilling to multiple files, and when an old file is completely marked as 
delivered, delete it)

I am not fully understanding how what you are describing (COW, separate 
fsync threads, etc) would be implemented on top of existing filesystems. 
Most of what you are describing seems like it requires access to the 
underlying storage to implement.

could you give a more detailed explination?

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 22:03         ` david
@ 2012-10-25  0:20           ` Nico Williams
  2012-10-25  1:04             ` david
  2012-10-25  5:42           ` Theodore Ts'o
  1 sibling, 1 reply; 58+ messages in thread
From: Nico Williams @ 2012-10-25  0:20 UTC (permalink / raw)
  To: david
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 5:03 PM,  <david@lang.hm> wrote:
> I'm doing some work with rsyslog and it's disk-baded queues and there is a
> similar issue there. The good news is that we can have a version that is
> linux specific (rsyslog is used on other OSs, but there is an existing queue
> implementation that they can use, if the faster one is linux-only, but is
> significantly faster, that's just a win for Linux)
>
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is a need
> to be sure that what is there is complete up to the point where it's lost.
>
> this is similar in concept to write-ahead-logs done for databases (without
> the absolute durability requirement)
>
> [...]
>
> I am not fully understanding how what you are describing (COW, separate
> fsync threads, etc) would be implemented on top of existing filesystems.
> Most of what you are describing seems like it requires access to the
> underlying storage to implement.
>
> could you give a more detailed explination?

COW is "copy on write", which is actually a bit of a misnomer -- all
COW means is that blocks aren't over-written, instead new blocks are
written.  In particular this means that inodes, indirect blocks, data
blocks, and so on, that are changed are actually written to new
locations, and the on-disk format needs to handle this indirection.

As for fsyn() and background threads... fsync() is synchronous, but in
this scheme we want it to happen asynchronously and then we want to
update each transaction with a pointer to the last transaction that is
known stable given an fsync()'s return.

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  0:20           ` Nico Williams
@ 2012-10-25  1:04             ` david
  2012-10-25  5:18               ` Nico Williams
  0 siblings, 1 reply; 58+ messages in thread
From: david @ 2012-10-25  1:04 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, 24 Oct 2012, Nico Williams wrote:

> On Wed, Oct 24, 2012 at 5:03 PM,  <david@lang.hm> wrote:
>> I'm doing some work with rsyslog and it's disk-baded queues and there is a
>> similar issue there. The good news is that we can have a version that is
>> linux specific (rsyslog is used on other OSs, but there is an existing queue
>> implementation that they can use, if the faster one is linux-only, but is
>> significantly faster, that's just a win for Linux)
>>
>> Like what is being described for sqlite, loosing the tail end of the
>> messages is not a big problem under normal conditions. But there is a need
>> to be sure that what is there is complete up to the point where it's lost.
>>
>> this is similar in concept to write-ahead-logs done for databases (without
>> the absolute durability requirement)
>>
>> [...]
>>
>> I am not fully understanding how what you are describing (COW, separate
>> fsync threads, etc) would be implemented on top of existing filesystems.
>> Most of what you are describing seems like it requires access to the
>> underlying storage to implement.
>>
>> could you give a more detailed explination?
>
> COW is "copy on write", which is actually a bit of a misnomer -- all
> COW means is that blocks aren't over-written, instead new blocks are
> written.  In particular this means that inodes, indirect blocks, data
> blocks, and so on, that are changed are actually written to new
> locations, and the on-disk format needs to handle this indirection.

so how can you do this, and keep the writes in order (especially between 
two files) without being the filesystem?

> As for fsyn() and background threads... fsync() is synchronous, but in
> this scheme we want it to happen asynchronously and then we want to
> update each transaction with a pointer to the last transaction that is
> known stable given an fsync()'s return.

If you could specify ordering between two writes, I could see a process 
along the lines of

Append new message to file1

append tiny status updates to file2

every million messages, move to new files. once the last message has been 
processed for the old set of files, delete them.

since file2 is small, you can reconstruct state fairly cheaply

But unless you are a filesystem, how can you make sure that the message 
data is written to file1 before you write the metadata about the message 
to file2?

right now it seems that there is no way for an application to do this 
other than doing a fsync(file1) before writing the metadata to file2

And there is no way for the application to tell the filesystem to write 
the data in file2 in order (to make sure that block 3 is not written and 
then have the system crash before block 2 is written), so the application 
needs to do frequent fsync(file2) calls.

If you need complete durability of your data, there are well documented 
ways of enforcing it (including the lwn.net article 
http://lwn.net/Articles/457667/ )

But if you don't need the gurantee that your data is on disk now, you just 
need to have it ordered so that if you crash you can be guaranteed only to 
loose data off of the tail of your file, there doesn't seem to be any way 
to do this other than using the fsync() hammer and wait for the overhead 
of forcing the data to disk now.


Or, as I type this, it occurs to me that you may be saying that every time 
you want to do an ordering guarantee, spawn a new thread to do the fsync 
and then just keep processing. The fsync will happen at some point, and 
the writes will not be re-ordered across the fsync, but you can keep 
going, writing more data while the fsync's are pending.

Then if you have a filesystem and I/O subsystem that can consolodate the 
fwyncs from all the different threads together into one I/O operation 
without having to flush the entire I/O queue for each one, you can get 
acceptable performance, with ordering. If the system crashes, data that 
hasn't had it's fsync() complete will be the only thing that is lost.

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-23 19:53     ` Vladislav Bolkhovitin
  2012-10-24 21:17       ` Nico Williams
@ 2012-10-25  5:14       ` Theodore Ts'o
  2012-10-25 13:03         ` Alan Cox
  2012-10-27  1:54         ` Vladislav Bolkhovitin
  1 sibling, 2 replies; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-25  5:14 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
> Yes, SCSI has full support for ordered/simple commands designed
> exactly for that task: to have steady flow of commands even in case
> when some of them are ordered.....

SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.  SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.

Yes, you can turn off writeback caching, but that has pretty huge
performance costs; and there is the FUA bit, but that's just an
unconditional high priority bypass of the writeback cache, which is
useful in some cases, but which again, does not give the ability for
the OS to specify a partial order, while letting the drive reorder
other requests for efficiency/performance's sake, since the drive has
a lot more information about the optimal way to reorder requests based
on the current location of the drive head and where certain blocks may
have been remapped due to bad block sparing, etc.

> Hopefully, eventually the storage developers will realize the value
> behind ordered commands and learn corresponding SCSI facilities to
> deal with them.

Eventually, drive manufacturers will realize that trying to price
guage people who want advanced features such as TCQ, DIF/DIX, is the
best way to gaurantee that most people won't bother to purchase them,
and hence the features will remain largely unused....

    	      	       	    	   	   - Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  1:04             ` david
@ 2012-10-25  5:18               ` Nico Williams
  2012-10-25  6:02                 ` Theodore Ts'o
  0 siblings, 1 reply; 58+ messages in thread
From: Nico Williams @ 2012-10-25  5:18 UTC (permalink / raw)
  To: david
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 8:04 PM,  <david@lang.hm> wrote:
> On Wed, 24 Oct 2012, Nico Williams wrote:
>> COW is "copy on write", which is actually a bit of a misnomer -- all
>> COW means is that blocks aren't over-written, instead new blocks are
>> written.  In particular this means that inodes, indirect blocks, data
>> blocks, and so on, that are changed are actually written to new
>> locations, and the on-disk format needs to handle this indirection.
>
> so how can you do this, and keep the writes in order (especially between two
> files) without being the filesystem?

By trusting fsync().  And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.

>> As for fsyn() and background threads... fsync() is synchronous, but in
>> this scheme we want it to happen asynchronously and then we want to
>> update each transaction with a pointer to the last transaction that is
>> known stable given an fsync()'s return.
>
> If you could specify ordering between two writes, I could see a process
> along the lines of
>
> [...]

fsync() deals with just one file.  fsync()s of different files are
another story.  That said, as long as the format of the two files is
COW then you can still compose transactions involving two files.  The
key is the file contents itself must be COW-structured.

Incidentally, here's a single-file, bag of b-trees that uses a COW
format: MDB, which can be found in
git://git.openldap.org/openldap.git, in the mdb.master branch.

> Or, as I type this, it occurs to me that you may be saying that every time
> you want to do an ordering guarantee, spawn a new thread to do the fsync and
> then just keep processing. The fsync will happen at some point, and the
> writes will not be re-ordered across the fsync, but you can keep going,
> writing more data while the fsync's are pending.

Yes, but only if the file's format is COWish.

The point is that COW saves the day.  A file-based DB needs to be COW.
 And the filesystem needs to be as well.

Note that write ahead logging approximates COW well enough most of the time.

> Then if you have a filesystem and I/O subsystem that can consolodate the
> fwyncs from all the different threads together into one I/O operation
> without having to flush the entire I/O queue for each one, you can get
> acceptable performance, with ordering. If the system crashes, data that
> hasn't had it's fsync() complete will be the only thing that is lost.

With the above caveat, yes.

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 22:03         ` david
  2012-10-25  0:20           ` Nico Williams
@ 2012-10-25  5:42           ` Theodore Ts'o
  2012-10-25  7:11             ` david
  1 sibling, 1 reply; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-25  5:42 UTC (permalink / raw)
  To: david
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@lang.hm wrote:
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is
> a need to be sure that what is there is complete up to the point
> where it's lost.
> 
> this is similar in concept to write-ahead-logs done for databases
> (without the absolute durability requirement)

If that's what you require, and you are using ext3/4, usng data
journalling might meet your requirements.  It's something you can
enable on a per-file basis, via chattr +j; you don't have to force all
file systems to use data journaling via the data=journalled mount
option.

The potential downsides that you may or may not care about for this
particular application:

(a) This will definitely have a performance impact, especially if you
are doing lots of small (less than 4k) writes, since the data blocks
will get run through the journal, and will only get written to their
final location on disk.

(b) You don't get atomicity if the write spans a 4k block boundary.
All of the bytes before i_size will be written, so you don't have to
worry about "holes"; but the last message written to the log file
might be truncated.

(c) There will be a performance impact, since the contents of data
blocks will be written at least twice (once to the journal, and once
to the final location on disk).  If you do lots of small, sub-4k
writes, the performance might be even worse, since data blocks might
be written multiple times to the journal.

						- Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:18               ` Nico Williams
@ 2012-10-25  6:02                 ` Theodore Ts'o
  2012-10-25  6:58                   ` david
  2012-10-30 23:49                   ` Nico Williams
  0 siblings, 2 replies; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-25  6:02 UTC (permalink / raw)
  To: Nico Williams
  Cc: david, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
> 
> By trusting fsync().  And if you don't care about immediate Durability
> you can run the fsync() in a background thread and mark the associated
> transaction as completed in the next transaction to be written after
> the fsync() completes.

The challenge is when you have entagled metadata updates.  That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated.  You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/

If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.

It certainly is not impossible; RDBMS's have implemented this.  On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems.  (About a decade ago,
Oracle tried to claim that you could run file system workloads using
an Oracle databsae as a back-end.  Everyone laughed at them, and the
idea died a quick, merciful death.)

Still, if you want to try to implement such a thing, by all means,
give it a try.  But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.

     	      		      	     	      	 - Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  6:02                 ` Theodore Ts'o
@ 2012-10-25  6:58                   ` david
  2012-10-25 14:03                     ` Theodore Ts'o
  2012-10-30 23:49                   ` Nico Williams
  1 sibling, 1 reply; 58+ messages in thread
From: david @ 2012-10-25  6:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>>
>> By trusting fsync().  And if you don't care about immediate Durability
>> you can run the fsync() in a background thread and mark the associated
>> transaction as completed in the next transaction to be written after
>> the fsync() completes.
>
> The challenge is when you have entagled metadata updates.  That is,
> you update file A, and file B, and file A and B might share metadata.
> In order to sync file A, you also have to update part of the metadata
> for the updates to file B, which means calculating the dependencies of
> what you have to drag in can get very complicated.  You can keep track
> of what bits of the metadata you have to undo and then redo before
> writing out the metadata for fsync(A), but that basically means you
> have to implement soft updates, and all of the complexity this
> implies: http://lwn.net/Articles/339337/
>
> If you can keep all of the metadata separate, this can be somewhat
> mitigated, but usually the block allocation records (regardless of
> whether you use a tree, or a bitmap, or some other data structure)
> tends of have entanglement problems.

hmm, two thoughts occur to me.

1. to avoid entanglement, put the two files in separate directories

2. take advantage of entaglement to enforce ordering


thread 1 (repeated): write new message to file 1, spawn new thread to 
fsync

thread 2: write to file 2 that message1-5 are being worked on

thread 2 (later): write to file 2 that messages 1-5 are done

when thread 1 spawns the new thread to do the fsync, the system will be 
forced to write the data to file 2 as of the time it does the fsync.

This should make it so that you never have data written to file2 that 
refers to data that hasn't been written to file1 yet.


> It certainly is not impossible; RDBMS's have implemented this.  On the
> other hand, they generally aren't as fast as file systems for
> non-transactional workloads, and people really care about performance
> on those sorts of workloads for file systems.

the RDBMS's have implemented stronger guarantees than what we are needing

A few years ago I was investigating this for logging. With the reliable 
(RDBMS style) , but inefficent disk queue that rsyslog has, writing to a 
high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K 
logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered 
stable enough at the time to be tested)

> Still, if you want to try to implement such a thing, by all means,
> give it a try.  But I think you'll find that creating a file system
> that can compete with existing file systems for performance, and
> *then* also supports a transactional model, is going to be quite a
> challenge.

The question is trying to figure a way to get ordering right with existing 
filesystms (preferrably without using something too tied to a single 
filesystem implementation), not try and create a new one.

The frustrating thing is that when people point out how things like sqlite 
are so horribly slow, the reply seems to be "well, that's what you get for 
doing so many fsyncs, don't do that", when there is a 'problem' like the 
KDE "config loss" problem a few years ago, the response is "well, that's 
what you get for not doing fsync"

Both responses are correct, from a purely technical point of view.

But what's missing is any way to get the result of ordered I/O that will 
let you do something pretty fast, but with the guarantee that, if you 
loose data in a crash, the only loss you are risking is that your most 
recent data may be missing. (either for one file, or using multiple files 
if that's what it takes)

Since this topic came up again, I figured I'd poke a bit and try to either 
get educated on how to do this "right" or try and see if there's something 
that could be added to the kernel to make it possible for userspace 
programs to do this.

What I think userspace really needs is something like a barrier function 
call. "for this fd, don't re-order writes as they go down through the 
stack"

If the hardware is going to reorder things once it hits the hardware, this 
is going to hurt performance (how much depends on a lot of stuff)

but the filesystems are able to make their journals work, so there should 
be some way to let userspace do some sort of similar ordering

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:42           ` Theodore Ts'o
@ 2012-10-25  7:11             ` david
  0 siblings, 0 replies; 58+ messages in thread
From: david @ 2012-10-25  7:11 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@lang.hm wrote:
>> Like what is being described for sqlite, loosing the tail end of the
>> messages is not a big problem under normal conditions. But there is
>> a need to be sure that what is there is complete up to the point
>> where it's lost.
>>
>> this is similar in concept to write-ahead-logs done for databases
>> (without the absolute durability requirement)
>
> If that's what you require, and you are using ext3/4, usng data
> journalling might meet your requirements.  It's something you can
> enable on a per-file basis, via chattr +j; you don't have to force all
> file systems to use data journaling via the data=journalled mount
> option.
>
> The potential downsides that you may or may not care about for this
> particular application:
>
> (a) This will definitely have a performance impact, especially if you
> are doing lots of small (less than 4k) writes, since the data blocks
> will get run through the journal, and will only get written to their
> final location on disk.
>
> (b) You don't get atomicity if the write spans a 4k block boundary.
> All of the bytes before i_size will be written, so you don't have to
> worry about "holes"; but the last message written to the log file
> might be truncated.
>
> (c) There will be a performance impact, since the contents of data
> blocks will be written at least twice (once to the journal, and once
> to the final location on disk).  If you do lots of small, sub-4k
> writes, the performance might be even worse, since data blocks might
> be written multiple times to the journal.

I'll have to dig into this option. In the case of rsyslog it sounds 
like it could work (not as good as a filesystem independant way of doing 
things, but better than full fsyncs)

Truncated messages are not great, but they are a detectable, and 
acceptable risk.

while the average message size is much smaller than 4K (on my network it's 
~250 bytes), the metadata that's broken out expands this somewhat, and we 
can afford to waste disk space if it makes things safer or more efficient.

If we do update in place with flags with each message, each message will 
need to be written up to three times (on recipt, being processed, finished 
processed). With high message burst rates, I'm worried that we would fill 
up the journal, is there a good way to deal with this?

I believe that ext4 can put the journal on a different device from the 
filesystem, would this help a lot?

If you were to put the journal for an ext4 filesystem on a ram disk, you 
would loose the data recovery protection of the journal, but could you use 
this trick to get ordered data writes onto the filesystem?

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:14       ` Theodore Ts'o
@ 2012-10-25 13:03         ` Alan Cox
  2012-10-25 13:50           ` Theodore Ts'o
  2012-10-27  1:54         ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 58+ messages in thread
From: Alan Cox @ 2012-10-25 13:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Vladislav Bolkhovitin, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

> > Hopefully, eventually the storage developers will realize the value
> > behind ordered commands and learn corresponding SCSI facilities to
> > deal with them.
> 
> Eventually, drive manufacturers will realize that trying to price
> guage people who want advanced features such as TCQ, DIF/DIX, is the
> best way to gaurantee that most people won't bother to purchase them,
> and hence the features will remain largely unused....

I doubt they care. The profit on high end features from the people who
really need them I would bet far exceeds any other benefit of giving it to
others. Welcome to capitalism 8)

Plus - spinning rust for those end users is on the way out, SATA to flash
is a bit of hack and people are already putting a lot of focus onto
things like NVM Express.

Alan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 13:03         ` Alan Cox
@ 2012-10-25 13:50           ` Theodore Ts'o
  2012-10-27  1:55             ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-25 13:50 UTC (permalink / raw)
  To: Alan Cox
  Cc: Vladislav Bolkhovitin, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote:
> 
> I doubt they care. The profit on high end features from the people who
> really need them I would bet far exceeds any other benefit of giving it to
> others. Welcome to capitalism 8)

Yes, but it's a question of pricing.  If they had priced it a just a
wee bit higher, then there would have been incentive to add support
for TCQ so it could actually be used into various Linux file systems,
since there would have been lots of users of it.  But as it is, the
folks who are purchasing huge, vast number of these drives --- such as
at the large cloud providers: Amazon, Facebook, Racespace, et. al. ---
will choose to purchase large numbers of commodity drives, and then
find ways to work around the missing functionality in userspace.  For
example, DIF/DIX would be nice, and if it were available for cheap, I
could imagine it being used.  But you can accomplish the same thing in
userspace, and in fact at Google I've implemented a special
not-for-mainline patch which spikes out stable writes (required for
DIF/DIX) because it has significant performance overhead, and DIF/DIX
has zero benefit if you're not willing to shell out $$$ for hardware
that supports it.

Maybe the HDD manufacturers have been able to price guage a small
number enterprise I/T shops with more dollars than sense, but
personally, I'm not convinced they picked an optimal pricing
strategy....

Put another way, I accept that Toyota should price a Lexus ES more
than a Camry, but if it's priced at say, 3x the price of a Camry
instead of 20%, they might find that precious few people are willing
to pay that kind of money for what is essentially the same car with
minor luxury tweaks added to it.

> Plus - spinning rust for those end users is on the way out, SATA to flash
> is a bit of hack and people are already putting a lot of focus onto
> things like NVM Express.

Yeah....  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed....

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.

      	 	       		    	      - Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  6:58                   ` david
@ 2012-10-25 14:03                     ` Theodore Ts'o
  2012-10-25 18:03                       ` david
  0 siblings, 1 reply; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-25 14:03 UTC (permalink / raw)
  To: david
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 11:58:49PM -0700, david@lang.hm wrote:
> The frustrating thing is that when people point out how things like
> sqlite are so horribly slow, the reply seems to be "well, that's
> what you get for doing so many fsyncs, don't do that", when there is
> a 'problem' like the KDE "config loss" problem a few years ago, the
> response is "well, that's what you get for not doing fsync"

Sure... but the answer is to only do the fsync's when you need to.
For example, if GNOME and KDE is rewriting the entire registry file
each time the application is changing a single registry key, sure, if
you rewrite the entire registry file, and then fsync after each
rewrite before you replace the file, you will be safe.  And if the
application needs to update dozens or hundreds of registry keys (or
every time the window gets moved or resized), then yes, it will be
slow.  But the application didn't have to do that!  It could have
updated all the registry keys in memory, and then update the registry
file periodically instead.

Similarly, Firefox didn't need to do a sqllite commit after every
single time its history file was written, causing a third of a
megabyte of write traffic each time you clicked on a web page.  It
could have batched its updates to the history file, since most of the
time, you don't care about making sure the web history is written to
stable store before you're allowed to click on a web page and visit
the next web page.

Or does rsyslog *really* need to issue an fsync after each log
message?  Or could it batch updates so that every N seconds, it
flushes writes to the disk?

(And this is a problem with most Android applications as well.
Apparently the framework API's are such that it's easier for an
application to treat each sqlite statement as an atomic update, so
many/most application writers don't use explicit transaction
boundaries, so updates don't get batched even though it would be more
efficient if they did so.)

Sometimes, the answer is not to try to create exotic database like
functionality in the file system --- the answer is to be more
intelligent at the application leyer.  Not only will the application
be more portable, it will also in the end be more efficient, since
even with the most exotic database technologies, the most efficient
transactional commit is the unneeded commit that you optimize away at
the application layer.

						- Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 14:03                     ` Theodore Ts'o
@ 2012-10-25 18:03                       ` david
  2012-10-25 18:29                         ` Theodore Ts'o
  0 siblings, 1 reply; 58+ messages in thread
From: david @ 2012-10-25 18:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> Or does rsyslog *really* need to issue an fsync after each log
> message?  Or could it batch updates so that every N seconds, it
> flushes writes to the disk?

In part this depends on how paranoid the admin is. By default rsyslog 
doesn't do fsyncs, but admins can configure it to do so and can configure 
the batch size.

However, what I'm talking about here is not normal message traffic, it's 
the case where the admin has decided that they don't want to use the 
normal inmemory queues, they want to have the queues be on disk so that if 
the system crashes the queued data will still be there to be processed 
after the crash (In addition, this can get used to cover cases where you 
want queue sizes larger than your available RAM)

In this case, the extreme, and only at the explicit direction of the 
admin, is to fsync after every message.

The norm is that it's acceptable to loose the last few messages, but 
loosing a chunk out of the middle of the queue file can cause a whole lot 
more to be lost, passing the threshold of acceptable.

> Sometimes, the answer is not to try to create exotic database like
> functionality in the file system --- the answer is to be more
> intelligent at the application leyer.  Not only will the application
> be more portable, it will also in the end be more efficient, since
> even with the most exotic database technologies, the most efficient
> transactional commit is the unneeded commit that you optimize away at
> the application layer.

I agree, this is why I'm trying to figure out the recommended way to do 
this without needing to do full commits.

Since in most cases it's acceptable to loose the last few chunks written, 
if we had some way of specifying ordering, without having to specify 
"write this NOW", the solution would be pretty obvious.

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 18:03                       ` david
@ 2012-10-25 18:29                         ` Theodore Ts'o
  2012-11-05 20:03                           ` Pavel Machek
  0 siblings, 1 reply; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-25 18:29 UTC (permalink / raw)
  To: david
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote:
> I agree, this is why I'm trying to figure out the recommended way to
> do this without needing to do full commits.
> 
> Since in most cases it's acceptable to loose the last few chunks
> written, if we had some way of specifying ordering, without having
> to specify "write this NOW", the solution would be pretty obvious.

Well, using data journalling with ext3/4 may do what you want.  If you
don't do any fsync, the changes will get written every 5 seconds when
the automatic journal sync happens (and sub-4k writes will also get
coalesced to a 5 second granularity).  Even with plain text files,
it's pretty easy to tell whether or not the final record is a
partially written or not after a crash; just look for a trailing
newline.

Better yet, if you are writing to multiple log files with data
journalling, all of the writes will happen at the same time, and they
will be streamed to the file system journal, minimizing random writes
for at least the journal writes.

						- Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 21:17       ` Nico Williams
  2012-10-24 22:03         ` david
@ 2012-10-27  1:52         ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-27  1:52 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh


Nico Williams, on 10/24/2012 05:17 PM wrote:
>> Yes, SCSI has full support for ordered/simple commands designed exactly for
>> that task: [...]
>>
>> [...]
>>
>> But historically for some reason Linux storage developers were stuck with
>> "barriers" concept, which is obviously not the same as ORDERED commands,
>> hence had a lot troubles with their ambiguous semantic. As far as I can tell
>> the reason of that was some lack of sufficiently deep SCSI understanding
>> (how to handle errors, believe that ACA is something legacy from parallel
>> SCSI times, etc.).
>
> Barriers are a very simple abstraction, so there's that.

It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.

>> Before that happens, people will keep returning again and again with those
>> simple questions: why the queue must be flushed for any ordered operation?
>> Isn't is an obvious overkill?
>
> That [cache flushing]

It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.

Often there's a big difference where it's done: on the system side, or on the 
storage side.

Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.

> is not what's being asked for here. Just a
> light-weight barrier.  My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,
> d) have an array of well-known ubberblocks large enough to accommodate
> as many transactions as possible without having to wait for any one
> fsync() to complete, d) do not reclaim space from any one past
> transaction until at least one subsequent transaction is fully
> committed.  This obtains ACI- transaction semantics (survives power
> failures but without durability for the last N transactions at
> power-failure time) without requiring changes to the OS at all, and
> with support for delayed D (durability) notification.

I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:

Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...

Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:14       ` Theodore Ts'o
  2012-10-25 13:03         ` Alan Cox
@ 2012-10-27  1:54         ` Vladislav Bolkhovitin
  2012-10-27  4:44           ` Theodore Ts'o
       [not found]           ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>
  1 sibling, 2 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-27  1:54 UTC (permalink / raw)
  To: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Theodore Ts'o, on 10/25/2012 01:14 AM wrote:
> On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
>> Yes, SCSI has full support for ordered/simple commands designed
>> exactly for that task: to have steady flow of commands even in case
>> when some of them are ordered.....
>
> SCSI does, yes --- *if* the device actually implements Tagged Command
> Queuing (TCQ).  Not all devices do.
>
> More importantly, SATA drives do *not* have this capability, and when
> you compare the price of SATA drives to uber-expensive "enterprise
> drives", it's not surprising that most people don't actually use
> SCSI/SAS drives that have implemented TCQ.

What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.

Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.

In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.

It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.

Price doesn't matter here, because it's completely different topic.

> SATA's Native Command
> Queuing (NCQ) is not equivalent; this allows the drive to reorder
> requests (in particular read requests) so they can be serviced more
> efficiently, but it does *not* allow the OS to specify a partial,
> relative ordering of requests.

And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 13:50           ` Theodore Ts'o
@ 2012-10-27  1:55             ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-27  1:55 UTC (permalink / raw)
  To: Theodore Ts'o, Alan Cox, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Theodore Ts'o, on 10/25/2012 09:50 AM wrote:
> Yeah....  I don't buy that.  One, flash is still too expensive.  Two,
> the capital costs to build enough Silicon foundries to replace the
> current production volume of HDD's is way too expensive for any
> company to afford (the cloud providers are buying *huge* numbers of
> HDD's) --- and that's assuming companies wouldn't chose to use those
> foundries for products with larger margins --- such as, for example,
> CPU/GPU chips. :-) And third and finally, if you study the long-term
> trends in terms of Data Retention Time (going down), Program and Read
> Disturb (going up), and Write Endurance (going down) as a function of
> feature size and/or time, you'd be wise to treat flash as nothing more
> than short-term cache, and not as a long term stable store.
>
> If end users completely give up on flash, and store all of their
> precious family pictures on flash storage, after a couple of years,
> they are likely going to be very disappointed....
>
> Speaking personally, I wouldn't want to have anything on flash for
> more than a few months at *most* before I made sure I had another copy
> saved on spinning rust platters for long-term retention.

Here I agree with you.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-27  1:54         ` Vladislav Bolkhovitin
@ 2012-10-27  4:44           ` Theodore Ts'o
  2012-10-30 22:22             ` Vladislav Bolkhovitin
       [not found]           ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>
  1 sibling, 1 reply; 58+ messages in thread
From: Theodore Ts'o @ 2012-10-27  4:44 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
> What different in our positions is that you are considering storage
> as something you can connect to your desktop, while in my view
> storage is something, which stores data and serves them the best
> possible way with the best performance.

I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.

As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?

> Price doesn't matter here, because it's completely different topic.

It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.

     	    	   	 	    - Ted

P.S.  All of the storage I have access to at home is SATA.  If someone
would like to change that and ship me free hardware, as long as it
doesn't require three-phase power (or require some exotic interconnect
which is ghastly expensive and which you are also not going to provide
me for free), do contact me off-line.  :-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-27  4:44           ` Theodore Ts'o
@ 2012-10-30 22:22             ` Vladislav Bolkhovitin
  2012-10-31  9:54               ` Alan Cox
  0 siblings, 1 reply; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-30 22:22 UTC (permalink / raw)
  To: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Theodore Ts'o, on 10/27/2012 12:44 AM wrote:
> On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
>> What different in our positions is that you are considering storage
>> as something you can connect to your desktop, while in my view
>> storage is something, which stores data and serves them the best
>> possible way with the best performance.
>
> I don't get paid to make Linux storage work well for gold-plated
> storage, and as far as I know, none of the purveyors of said gold
> plated software systems are currently employing Linux file system
> developers to make Linux file systems work well on said gold-plated
> hardware.

I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.

> As for what I might do on my own time, for fun, I can't afford said
> gold-plated hardware, and personally I get a lot more satisfaction if
> I know there will be a large number of people who benefit from my work
> (it was really cool when I found out that millions and millions of
> Android devices were going to be using ext4 :-), as opposed to a very
> small number of people who have paid $$$ to storage vendors who don't
> feel it's worthwhile to pay core Linux file system developers to
> leverage their hardware.  Earlier, you were bemoaning why Linux file
> system developers weren't paying attention to using said fancy SCSI
> features.  Perhaps now you'll understand better it's not happening?
>
>> Price doesn't matter here, because it's completely different topic.
>
> It matters if you think I'm going to do it on my own time, out of my
> own budget.  And if you think my employer is going to choose to use
> said hardware, price definitely matters.  I consider engineering to be
> the art of making tradeoffs, and price is absolutely one of the things
> that we need to trade off against other goals.
>
> It's rare that you get to design something where performance matters
> above all else.  Maybe it's that way if you're paid by folks whose job
> it is to destablize the world's financial markets by pushing the holes
> into the right half plane (i.e., high frequency trading :-).  But for
> the rest of the world, price absolutely matters.

I fully understand your position. But "affordable" and "useful" are completely 
orthogonal things. The "high end" features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.

Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].

Our discussion started not from "value-for-money", but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?

Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices "high-end".



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  6:02                 ` Theodore Ts'o
  2012-10-25  6:58                   ` david
@ 2012-10-30 23:49                   ` Nico Williams
  1 sibling, 0 replies; 58+ messages in thread
From: Nico Williams @ 2012-10-30 23:49 UTC (permalink / raw)
  To: Theodore Ts'o, Nico Williams, david,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel

[Dropping sqlite-users.  Note that I'm not subscribed to any of the
other lists cc'ed.]

On Thu, Oct 25, 2012 at 1:02 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>>
>> By trusting fsync().  And if you don't care about immediate Durability
>> you can run the fsync() in a background thread and mark the associated
>> transaction as completed in the next transaction to be written after
>> the fsync() completes.

You are all missing some context which I would have added had I
noticed the cc'ing of additional lists.

D.R. Hipp asked for a light-weight barrier API from the OS/filesystem,
the SQLite use-case being to implement fast ACI_ semantics, without
durability (i.e., that it be OK to lose the last few transactions, but
not to end up with a corrupt DB, and maintaining atomicity,
consistency, and isolation).

I noted that a journalled/COW DB file format[0] one could run an
fsync() in a "background" thread to act as a barrier, and then note in
each transaction the last preceding transaction known to have reached
disk (because fsync() returned and the bg thread marked the
transaction in question as durable).  Then refrain from garbage
collecting any transactions not marked as durable.  Now, there are
some caveats, the main one being that this fails if the filesystem or
hardware lie about fsync() / cache flushes.  Other caveats include
that fsync() used this way can have more impact on filesystem
performance than a true light-weight barrier[1], that the filesystem
itself might not be powerfail-safe, and maybe a few others.  But the
point is that fsync() can be used in such a way that one need not wait
for a transaction to reach rotating rust stably and still retain
powerfail safety without durability for the last few transactions.

[0] Like the BSD4.4 log structured filesystem, ZFS, Howard Chu's MDB,
and many others.  Note that ZFS has a pool-import time option to
recover from power failures by ignoring any not completely verifiable
transactions and rolling back to the last verifiable one.

[1] Think of what ZFS does when there's no ZIL and an fsync() comes
along: ZFS will either block the fsync() thread until the current
transaction closes or else close the current transaction and possibly
write a much smaller transaction, thus losing out on making writes as
large and contiguous as possible.

> The challenge is when you have entagled metadata updates.  That is,
> you update file A, and file B, and file A and B might share metadata.
> In order to sync file A, you also have to update part of the metadata
> for the updates to file B, which means calculating the dependencies of
> what you have to drag in can get very complicated.  You can keep track
> of what bits of the metadata you have to undo and then redo before
> writing out the metadata for fsync(A), but that basically means you
> have to implement soft updates, and all of the complexity this
> implies: http://lwn.net/Articles/339337/

I believe that my suggestion composes for multi-file DB file formats,
as long as the sum total forms a COWish on-disk format.  Of course,
adding more fsync()s, even if run in bg threads, may impact system
performance even more (see above).  Also, if one has a COWish DB then
why use more than one file?  If the answer were "to spread contents
across devices" one might ask "why not trust the filesystem/volume
manager to do that?", but hey.

I'm not actually proposing that people try to compose this ACI_
technique though...

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-30 22:22             ` Vladislav Bolkhovitin
@ 2012-10-31  9:54               ` Alan Cox
  2012-11-01 20:18                 ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 58+ messages in thread
From: Alan Cox @ 2012-10-31  9:54 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

> I don't want to flame on this topic, but you are not right here. As far as I can 
> see, a big chunk of Linux storage and file system developers are/were employed by 
> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
> 
> You know, RedHat from recent times also stepped to this market, at least I saw 
> their advertisement on SDC 2012. So, you can add here all RedHat employees.

Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point

(and I used to have fibrechannel on my Thinkpad 600 when docked 8))

> Our discussion started not from "value-for-money", but from a constant demand to 
> perform ordered commands without full queue draining, which is ignored by the 
> Linux storage developers for YEARS as not useful, right?

Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.

Alan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-31  9:54               ` Alan Cox
@ 2012-11-01 20:18                 ` Vladislav Bolkhovitin
  2012-11-01 21:24                   ` Alan Cox
  0 siblings, 1 reply; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-01 20:18 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Alan Cox, on 10/31/2012 05:54 AM wrote:
>> I don't want to flame on this topic, but you are not right here. As far as I can
>> see, a big chunk of Linux storage and file system developers are/were employed by
>> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
>>
>> You know, RedHat from recent times also stepped to this market, at least I saw
>> their advertisement on SDC 2012. So, you can add here all RedHat employees.
>
> Booleans generally should be reserved for logic operators. Most of the
> Linux companies work on both low and high end storage. The two are not
> mutually exclusive nor do they divide neatly by market. Many big clouds
> use cheap low end drives by the crate, some high end desktops are using
> SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
> not sure personally there is much point

Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.

> Send patches with benchmarks demonstrating it is useful. It's really
> quite simple. Code talks.

How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because "there's no 
difference where to drain the queue, on the kernel or the storage side"?

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-01 20:18                 ` Vladislav Bolkhovitin
@ 2012-11-01 21:24                   ` Alan Cox
  2012-11-02  0:15                     ` Vladislav Bolkhovitin
  2012-11-02  0:38                     ` Howard Chu
  0 siblings, 2 replies; 58+ messages in thread
From: Alan Cox @ 2012-11-01 21:24 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

> How about that recently preliminary infrastructure to send ORDERED commands 
> instead of queue draining was deleted from the kernel, because "there's no 
> difference where to drain the queue, on the kernel or the storage side"?

Send patches.

Alan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-01 21:24                   ` Alan Cox
@ 2012-11-02  0:15                     ` Vladislav Bolkhovitin
  2012-11-02  0:38                     ` Howard Chu
  1 sibling, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-02  0:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	linux-kernel, linux-fsdevel, drh


Alan Cox, on 11/01/2012 05:24 PM wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.

OK, then we have a good progress!

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-01 21:24                   ` Alan Cox
  2012-11-02  0:15                     ` Vladislav Bolkhovitin
@ 2012-11-02  0:38                     ` Howard Chu
  2012-11-02 12:33                       ` Alan Cox
                                         ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: Howard Chu @ 2012-11-02  0:38 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Alan Cox, Vladislav Bolkhovitin, Theodore Ts'o, drh,
	linux-kernel, linux-fsdevel

Alan Cox wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.

Isn't any type of kernel-side ordering an exercise in futility, since
   a) the kernel has no knowledge of the disk's actual geometry
   b) most drives will internally re-order requests anyway
   c) cheap drives won't support barriers

Even assuming the drives honored all your requests without lying, how would 
you really want this behavior exposed? From the userland perspective, there 
are very few apps that care. Probably only transactional databases, really.

As a DB author, I'm not sure I'd be keen on this as an open() or fcntl() 
option. Databases that really care would be on dedicated filesystems and/or 
devices, so per-file control would be tedious. You would most likely want to 
say "all writes to this string of devices should be order-preserving" and 
forget about it. With that guarantee, a careful writer can have perfectly 
intact data structures all the time, without ever slowing down for a fsync.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-02  0:38                     ` Howard Chu
@ 2012-11-02 12:33                       ` Alan Cox
  2012-11-13  3:41                         ` Vladislav Bolkhovitin
  2012-11-13  3:37                       ` Vladislav Bolkhovitin
       [not found]                       ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>
  2 siblings, 1 reply; 58+ messages in thread
From: Alan Cox @ 2012-11-02 12:33 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel

> Isn't any type of kernel-side ordering an exercise in futility, since
>    a) the kernel has no knowledge of the disk's actual geometry
>    b) most drives will internally re-order requests anyway

They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.

>    c) cheap drives won't support barriers

Barriers are pretty much universal as you need them for power off !

> Even assuming the drives honored all your requests without lying, how would 
> you really want this behavior exposed? From the userland perspective, there 
> are very few apps that care. Probably only transactional databases, really.

And file systems internally sometimes. A file system is after all a
transactional database of sorts.

Alan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 18:29                         ` Theodore Ts'o
@ 2012-11-05 20:03                           ` Pavel Machek
  2012-11-05 22:04                             ` Theodore Ts'o
  0 siblings, 1 reply; 58+ messages in thread
From: Pavel Machek @ 2012-11-05 20:03 UTC (permalink / raw)
  To: Theodore Ts'o, david, Nico Williams,
	General Discussion of SQLite Database, ????????? Yang Su Li,
	linux-fsdevel, linux-kernel, drh

On Thu 2012-10-25 14:29:48, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote:
> > I agree, this is why I'm trying to figure out the recommended way to
> > do this without needing to do full commits.
> > 
> > Since in most cases it's acceptable to loose the last few chunks
> > written, if we had some way of specifying ordering, without having
> > to specify "write this NOW", the solution would be pretty obvious.
> 
> Well, using data journalling with ext3/4 may do what you want.  If you
> don't do any fsync, the changes will get written every 5 seconds when
> the automatic journal sync happens (and sub-4k writes will also get

Hmm. But that would need setting journalling mode per-file, no?

Like, make it journal data for all the databases, but keep normal mode
for rest of system...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-05 20:03                           ` Pavel Machek
@ 2012-11-05 22:04                             ` Theodore Ts'o
       [not found]                               ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com>
  0 siblings, 1 reply; 58+ messages in thread
From: Theodore Ts'o @ 2012-11-05 22:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Nico Williams, General Discussion of SQLite Database,
	????????? Yang Su Li, linux-fsdevel, linux-kernel, drh

On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > Well, using data journalling with ext3/4 may do what you want.  If you
> > don't do any fsync, the changes will get written every 5 seconds when
> > the automatic journal sync happens (and sub-4k writes will also get
> 
> Hmm. But that would need setting journalling mode per-file, no?
> 
> Like, make it journal data for all the databases, but keep normal mode
> for rest of system...

You can do that, using "chattr +j file.db".  It's apparently not a
well known feature of ext3/4....

						- Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                               ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com>
@ 2012-11-05 23:00                                 ` Theodore Ts'o
  0 siblings, 0 replies; 58+ messages in thread
From: Theodore Ts'o @ 2012-11-05 23:00 UTC (permalink / raw)
  To: Richard Hipp
  Cc: General Discussion of SQLite Database, Pavel Machek, david,
	Nico Williams, ????????? Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Mon, Nov 05, 2012 at 05:37:02PM -0500, Richard Hipp wrote:
> 
> Per the docs:  "Only the superuser or a process possessing the
> CAP_SYS_RESOURCE capability can set or clear this attribute."  That
> prevents most applications that run SQLite from being able to take
> advantage of this, since most such applications lack elevated privileges.

If this feature would prove useful to sqllite, that's something we
could address.  I could image making this available to processes that
belong to a specific group that would be specified in the superblock
or as a mount option.  (We already have something like that which
allows a specific uid or gid to use the reserved space in the
superblock.)

							- Ted

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-02  0:38                     ` Howard Chu
  2012-11-02 12:33                       ` Alan Cox
@ 2012-11-13  3:37                       ` Vladislav Bolkhovitin
       [not found]                       ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>
  2 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:37 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Alan Cox,
	Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel,
	linux-fsdevel


Howard Chu, on 11/01/2012 08:38 PM wrote:
> Alan Cox wrote:
>>> How about that recently preliminary infrastructure to send ORDERED commands
>>> instead of queue draining was deleted from the kernel, because "there's no
>>> difference where to drain the queue, on the kernel or the storage side"?
>>
>> Send patches.
>
> Isn't any type of kernel-side ordering an exercise in futility, since
> a) the kernel has no knowledge of the disk's actual geometry
> b) most drives will internally re-order requests anyway
> c) cheap drives won't support barriers

This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-02 12:33                       ` Alan Cox
@ 2012-11-13  3:41                         ` Vladislav Bolkhovitin
  2012-11-13 17:40                           ` Alan Cox
  0 siblings, 1 reply; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: Howard Chu, General Discussion of SQLite Database,
	Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel,
	linux-fsdevel


Alan Cox, on 11/02/2012 08:33 AM wrote:
>>     b) most drives will internally re-order requests anyway
>
> They will but only as permitted by the commands queued, so you have some
> control depending upon the interface capabilities.
>
>>     c) cheap drives won't support barriers
>
> Barriers are pretty much universal as you need them for power off !

I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.

Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.

3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.

Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.


Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?

This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.

Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                       ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>
@ 2012-11-13  3:41                         ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:41 UTC (permalink / raw)
  To: Richard Hipp
  Cc: General Discussion of SQLite Database, Theodore Ts'o, drh,
	linux-kernel, linux-fsdevel, Alan Cox

Richard Hipp, on 11/02/2012 08:24 AM wrote:
> SQLite cares.  SQLite is an in-process, transaction, zero-configuration
> database that is estimated to be used by over 1 million distinct
> applications and to be have over 2 billion deployments.  SQLite uses
> ordinary disk files in ordinary directories, often selected by the
> end-user.  There is no system administrator with SQLite, so there is no
> opportunity to use a dedicated filesystem with special mount options.
>
> SQLite uses fsync() as a write barrier to assure consistency following a
> power loss.  In addition, we do everything we can to maximize the amount of
> time after the fsync() before we actually do another write where order
> matters, in the hopes that the writes will still be ordered on platforms
> where fsync() is ignored for whatever reason.  Even so, we believe we could
> get a significant performance boost and reliability improvement if we had a
> reliable write barrier.

I would suggest you to forget word "barrier" for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]           ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>
@ 2012-11-13  3:42             ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:42 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: Theodore Ts'o, General Discussion of SQLite Database,
	linux-kernel, linux-fsdevel, Richard Hipp

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:
>>  SATA's Native Command
>>> Queuing (NCQ) is not equivalent; this allows the drive to reorder
>>> requests (in particular read requests) so they can be serviced more
>>> efficiently, but it does *not* allow the OS to specify a partial,
>>> relative ordering of requests.
>>>
>>
>> And so? If SATA can't do it, does it mean that nobody else can't do it
>> too? I know a plenty of non-SATA devices, which can do the ordering
>> requirements you need.
>>
>
> I would be very much interested in what kind of device support this kind of
> "topological order", and in what settings they are typically used.
>
> Does modern flash/SSD (esp. which are used on smartphones) support this?
>
> If you could point me to some information about this, that would be very
> much appreciated.

I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.

But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-13  3:41                         ` Vladislav Bolkhovitin
@ 2012-11-13 17:40                           ` Alan Cox
  2012-11-13 19:13                             ` Nico Williams
  2012-11-15  1:16                             ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 58+ messages in thread
From: Alan Cox @ 2012-11-13 17:40 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Howard Chu, General Discussion of SQLite Database,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel

> > Barriers are pretty much universal as you need them for power off !
> 
> I'm afraid, no storage (drives, if you like this term more) at the moment supports 
> barriers and, as far as I know the storage history, has never supported.

The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.

> Instead, what storage does support in this area are:

Yes - the devil is in the detail once you go beyond simple capabilities.

Alan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-13 17:40                           ` Alan Cox
@ 2012-11-13 19:13                             ` Nico Williams
  2012-11-15  1:17                               ` Vladislav Bolkhovitin
  2012-11-15  1:16                             ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 58+ messages in thread
From: Nico Williams @ 2012-11-13 19:13 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On Tue, Nov 13, 2012 at 11:40 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> > Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.
>
>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

Right: barriers are trivial to program with.  Ordered writes less so.
One could declare all writes to be ordered with respect to each other,
but this will almost certainly hurt performance (at least with disks,
though probably not SSDs) as opposed to barriers, which order one
group of internally-not-order writes relative to another.  And
declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.

There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-13 17:40                           ` Alan Cox
  2012-11-13 19:13                             ` Nico Williams
@ 2012-11-15  1:16                             ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-15  1:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Howard Chu, General Discussion of SQLite Database,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel


Alan Cox, on 11/13/2012 12:40 PM wrote:
>>> Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.

The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.

>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-13 19:13                             ` Nico Williams
@ 2012-11-15  1:17                               ` Vladislav Bolkhovitin
  2012-11-15 12:07                                 ` David Lang
  2012-11-15 17:06                                 ` Ryan Johnson
  0 siblings, 2 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-15  1:17 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel


Nico Williams, on 11/13/2012 02:13 PM wrote:
> declaring groups of internally-unordered writes where the groups are
> ordered with respect to each other... is practically the same as
> barriers.

Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?

There's no such thing as "barrier". It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.

Are you going to invent 3 types of barriers?

> There's a lot to be said for simplicity... as long as the system is
> not so simple as to not work at all.
>
> My p.o.v. is that a filesystem write barrier is effectively the same
> as fsync() with the ability to return sooner (before writes hit stable
> storage) when the filesystem and hardware support on-disk layouts and
> primitives which can be used to order writes preceding and succeeding
> the barrier.

Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.

Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware => less you will get from it with 
bigger effort.

There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.

Vlad


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15  1:17                               ` Vladislav Bolkhovitin
@ 2012-11-15 12:07                                 ` David Lang
  2012-11-16 15:06                                   ` Howard Chu
                                                     ` (2 more replies)
  2012-11-15 17:06                                 ` Ryan Johnson
  1 sibling, 3 replies; 58+ messages in thread
From: David Lang @ 2012-11-15 12:07 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Nico Williams, General Discussion of SQLite Database,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:

> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning commands 
> order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial abstraction. After 
> all, at the bottom of your stack, you will have to translate it either to 
> cache flush, or commands order enforcement, or both.

When people talk about barriers, they are talking about order enforcement.

> Your mistake is that you are considering barriers as something real, which 
> can do something real for you, while it is just a artificial abstraction 
> apparently invented by people with limited knowledge how storage works, hence 
> having very foggy vision how barriers supposed to be processed by it. A 
> simple wrong answer.
>
> Generally, you can invent any abstraction convenient for you, but farther 
> your abstractions from reality of your hardware => less you will get from it 
> with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And start 
> instead thinking about offload capabilities your storage can offer to you.

the hardware capabilities are not directly accessable from userspace (and they 
probably shouldn't be)

barriers keep getting mentioned because they are a easy concept to understand. 
"do this set of stuff before doing any of this other set of stuff, but I don't 
care when any of this gets done" and they fit well with the requirements of the 
users.

Users readily accept that if the system crashes, they will loose the most recent 
stuff that they did, but they get annoyed when things get corrupted to the point 
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the 
config file being blank. Yes, you can do the 'write to temp file, sync file, 
sync directory, rename file" dance, but the fact that to do so the user must sit 
and wait for the syncs to take place can be a problem. It would be far better to 
be able to say "write to temp file, and after it's on disk, rename the file" and 
not have the user wait. The user doesn't really care if the changes hit disk 
immediately, or several seconds (or even 10s of seconds) later, as long as there 
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing 
hardware does not mean that there need to be multiple ways exposed to userspace, 
it just means that the cost of doing the operation will vary depending on the 
hardware that you have. This also means that if new hardware introduces a new 
way of implementing this, that improvement can be passed on to the users without 
needing application changes.

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15  1:17                               ` Vladislav Bolkhovitin
  2012-11-15 12:07                                 ` David Lang
@ 2012-11-15 17:06                                 ` Ryan Johnson
  2012-11-15 22:35                                   ` Chris Friesen
  1 sibling, 1 reply; 58+ messages in thread
From: Ryan Johnson @ 2012-11-15 17:06 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Vladislav Bolkhovitin, Nico Williams, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp

On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote:
> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning 
> commands order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial 
> abstraction. After all, at the bottom of your stack, you will have to 
> translate it either to cache flush, or commands order enforcement, or 
> both.
Isn't that  why we *have* "the stack" in the first place? So apps 
*don't* have to worry about how the OS implements an artificial (= 
high-level and portable) abstraction on a given device?

>
> Are you going to invent 3 types of barriers?
One will do, it just needs to be a good one.

Maybe I'm missing something here, so I'm going to back up a bit and 
recap what I understand.

The filesystem abstracts the concept of encoding patterns of bits in 
some physical media (data), and making it easy to find and retrieve 
those bits later (metadata, incl. file name). When users read(), they 
expect to see whatever they most recently sent to write(). They also 
expect that what they write will still be there later,  in spite of any 
failure that leaves the disk itself intact.

Operating systems cheat by not actually writing to disk -- for 
performance reasons -- and users are (mostly, usually) OK with that, 
because the performance gains are so attractive and things usually work 
out anyway. Disks cheat too, in the same way and for the same reason.

The cheating works great most of the time, but breaks down -- badly -- 
if we actually care about what is on disk after a crash (or if we use a 
network filesystem). Enough people do care that fsync() was added to the 
toolbox. It is defined to transfer "all modified in-core data of the 
file referred to by the file descriptor fd to the disk device" and 
"blocks until the device reports that the transfer has completed" 
(quoting from the fsync(2) man page). Translation: "Stop cheating. Make 
sure the stuff I already wrote actually got written. And tell the disk 
to stop cheating, too."

Problem is, this definition is asymmetric: it says what happens to 
writes issued before the fsync, but nothing about those issued after the 
fsync starts and before it returns [1]. The reader has to assume  
fsync() makes no promises whatsoever about these later writes: making 
fsync capture them exposes callers of fsync() to DoS attacks, and them 
from reaching disk until all outstanding fsync calls complete would add 
complexity the spec doesn't currently demand, leading to understandable 
reluctance by kernel devs to code it up. Unfortunately, we're left with 
the filesystem equivalent of what we in the database world call 
"eventual consistency" -- easy to implement, nice and fast, but very 
difficult to write reliable code against unless you're willing to pay 
the cost of being fully synchronous, all the time. Having tried that for 
a few years, many people are "returning" to better-specified concurrency 
models, trading some amount of performance for comfort that the app will 
at least work predictably when things go wrong in strange and 
unanticipated ways.

The request, then, is to tighten up fsync semantics in two conceptually 
straightforward ways [2]: First, guarantee that later writes to an fd do 
not hit disk until earlier calls to fsync() complete. Second, make the 
call asynchronous. That's all.

Note that both changes are necessary. The improved ordering semantic 
useless by itself, because it's still not safe to request a blocking 
fsync from one thread and and then let other threads continue issuing 
writes: there's a race between broadcasting that fsync has begun and 
issuing the actual syscall that begins it. An asynchronous fsync is also 
useless by itself, because it only benefits uncoordinated writes (which 
evidently don't care what data actually reaches disk anyway).

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to 
the affected file, wait for the device to report success, issue a cache 
flush to the device (or request ordering commands, if available) to make 
it tell the truth, and wait for the device to report success. AFAIK this 
already happens, but without taking advantage of any request ordering 
commands.
2. The requesting thread returns as soon as the kernel has identified 
all data that will be written back. This is new, but pretty similar to 
what AIO already does.
3. No write is allowed to enqueue any requests at the device that 
involve the same file, until all outstanding fsync complete [3]. This is 
new.

The performance hit for #1 can be reduced significantly if the storage 
hardware at hand happens to support some form of request ordering. The 
amount of reduction could vary greatly depending on how sophisticated 
such request ordering is, and how much effort the kernel and/or device 
driver are willing to work for it. In any case, fsync should already do 
this [4].

The performance hit for #3 can be minimized by buffering small or 
otherwise convenient writes in the fs cache and letting the call return 
immediately, as usual. The corresponding pages just have to be marked in 
some way to prevent them from being written back too soon. Sequence 
numbers work well for this sort of thing. Big requests may have to 
block, but they probably would have anyway, if the buffer cache couldn't 
absorb them. As with #1, fancy command ordering capabilities in the 
underlying device just allow additional performance optimizations.

A carefully-written app (e.g. free of I/O races) would do pretty well 
with this extended fsync, certainly far better than the current state of 
the art allows.

Note that this still offers no protection for reads: no matter how many 
times a thread issues fsync(), it still risks reading non-durable data 
because reads are not ordered wrt either writes or fsync. That's not the 
problem we're trying to solve, though.

Please feel free to point out where I've gone wrong, but this just 
doesn't look like as complex or crazy an idea as you make it out to be.

[1] Maybe posix.1-1001 is more specific, but it's not publicly available 
that I could see.

[2] I'm fully aware that implementing the request might require 
significant -- perhaps even unreasonably complex -- changes to the way 
the kernel currently does things (though I do doubt it). That's not a 
good excuse to claim the idea itself is unreasonably complex or 
ill-specified. Just say that it's not a good fit for the current code base.

[3]  Another concern is whether fsync calls operate on the file or a 
particular fd. What if a process opens the same file multiple times, or 
multiple processes have fds pointing to the same file (whether by open 
or fork)? I would argue for file-level barriers, because it leads to a 
vastly simpler design (the fs cache doesn't track which process wrote 
what via what fd). Besides, no app that cares about what ends up on disk 
will allow uncoordinated writes anyway, so why do extra work just to 
ensure I/O races stay fast?

[4] Really, device support for request ordering commands is a bit of a 
red herring: the only way it helps significantly is if (a) the storage 
device has a massive cache compared to the fs cache, (b) it allows I/O 
scheduling to reduce latency of reads and/or writes (which fsync should 
do already, and which matters little for flash), and (c) a logging 
filesystem is not being used (else it's all sequential writes anyway). 
In other words, it can help performance a bit but has little other 
impact on what is essentially a software matter.

>
>> There's a lot to be said for simplicity... as long as the system is
>> not so simple as to not work at all.
>>
>> My p.o.v. is that a filesystem write barrier is effectively the same
>> as fsync() with the ability to return sooner (before writes hit stable
>> storage) when the filesystem and hardware support on-disk layouts and
>> primitives which can be used to order writes preceding and succeeding
>> the barrier.
>
> Your mistake is that you are considering barriers as something real, 
> which can do something real for you, while it is just a artificial 
> abstraction apparently invented by people with limited knowledge how 
> storage works, hence having very foggy vision how barriers supposed to 
> be processed by it. A simple wrong answer.
Storage: Accepts writes and ostensibly makes them available via reads 
even after power failures. Reorders requests nearly arbitrarily and lies 
about whether writes actually took effect, unless you issue appropriate 
cache flushing and/or request ordering commands (and sometimes even 
then, if it was a cheap consumer drive).

OS: Accepts writes and ostensibly makes them available via reads even 
after power failures, reboots, etc. Reorders requests nearly arbitrarily 
and lies about whether writes actually took effect, unless you issue a 
stop-the-world, one-sided write barrier lovingly known as fsync 
(assuming the actually disk listens when you tell it to stop cheating).

Wish: a two-sided write barrier that not only ensures previously-issued 
writes complete before it reports success, but also prevents 
later-issued writes from completing while it is in progress, giving a 
reasonably simple way to enforce some ordering of writes in the system. 
Can be implemented entirely in software, as the latter has full control 
over which requests it chooses to schedule at the device, and also 
decides whether to block the requesting thread or not. Can be made 
virtually as fast as current writes, by maintaining a little extra 
information in the fs cache.

Please, enlighten me: in what way does my limited knowledge of storage, 
or my foggy vision of what is desired, make this feature impossible to 
implement or useless if implemented?

>
> Generally, you can invent any abstraction convenient for you, but 
> farther your abstractions from reality of your hardware => less you 
> will get from it with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And 
> start instead thinking about offload capabilities your storage can 
> offer to you.
Apologies if this comes off as flame-bait, but I start to wonder whose 
abstraction is broken here...

What I understand the above to mean is: "Linux file system abstractions 
are too far from the reality of storage hardware, so it takes lots of 
effort to accomplish little [in the way of enforcing write ordering]. 
Accept it. And start thinking instead about talking directly to a 
storage controller that offers proper write barriers."

I hope I misread what you said, because that's a depressing thing to 
hear from your OS.

Ryan


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 17:06                                 ` Ryan Johnson
@ 2012-11-15 22:35                                   ` Chris Friesen
  2012-11-17  5:02                                     ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Friesen @ 2012-11-15 22:35 UTC (permalink / raw)
  To: Ryan Johnson
  Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin,
	Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel,
	Richard Hipp

On 11/15/2012 11:06 AM, Ryan Johnson wrote:

> The easiest way to implement this fsync would involve three things:
> 1. Schedule writes for all dirty pages in the fs cache that belong to
> the affected file, wait for the device to report success, issue a cache
> flush to the device (or request ordering commands, if available) to make
> it tell the truth, and wait for the device to report success. AFAIK this
> already happens, but without taking advantage of any request ordering
> commands.
> 2. The requesting thread returns as soon as the kernel has identified
> all data that will be written back. This is new, but pretty similar to
> what AIO already does.
> 3. No write is allowed to enqueue any requests at the device that
> involve the same file, until all outstanding fsync complete [3]. This is
> new.

This sounds interesting as a way to expose some useful semantics to 
userspace.

I assume we'd need to come up with a new syscall or something since it 
doesn't match the behaviour of posix fsync().

Chris

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 12:07                                 ` David Lang
@ 2012-11-16 15:06                                   ` Howard Chu
  2012-11-16 15:31                                     ` Ric Wheeler
  2012-11-16 19:14                                     ` David Lang
  2012-11-17  5:02                                   ` Vladislav Bolkhovitin
       [not found]                                   ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>
  2 siblings, 2 replies; 58+ messages in thread
From: Howard Chu @ 2012-11-16 15:06 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: David Lang, Vladislav Bolkhovitin, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel

David Lang wrote:
> barriers keep getting mentioned because they are a easy concept to understand.
> "do this set of stuff before doing any of this other set of stuff, but I don't
> care when any of this gets done" and they fit well with the requirements of the
> users.
>
> Users readily accept that if the system crashes, they will loose the most recent
> stuff that they did,

*some* users may accept that. *None* should.

> but they get annoyed when things get corrupted to the point
> that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in the
> config file being blank. Yes, you can do the 'write to temp file, sync file,
> sync directory, rename file" dance, but the fact that to do so the user must sit
> and wait for the syncs to take place can be a problem. It would be far better to
> be able to say "write to temp file, and after it's on disk, rename the file" and
> not have the user wait. The user doesn't really care if the changes hit disk
> immediately, or several seconds (or even 10s of seconds) later, as long as there
> is not any possibility of the rename hitting disk before the file contents.
>
> The fact that this could be implemented in multiple ways in the existing
> hardware does not mean that there need to be multiple ways exposed to userspace,
> it just means that the cost of doing the operation will vary depending on the
> hardware that you have. This also means that if new hardware introduces a new
> way of implementing this, that improvement can be passed on to the users without
> needing application changes.

There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
   A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just a 
firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.
   B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC failure 
is detected in the cacheline, the drive needs to tell the host "oh by the way, 
block XXX didn't actually make it to disk like I told you it did 10ms ago."

If the entire software industry were to simply state "your shit stinks and 
we're not going to take it any more" the hard drive industry would have no 
choice but to fix it. And in most cases it would be a zero-cost fix for them.

Once you have drives that are actually trustworthy, actually reliable (which 
doesn't mean they never fail, it only means they tell the truth about 
successes or failures), most of these other issues disappear. Most of the need 
for barriers disappear.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-16 15:06                                   ` Howard Chu
@ 2012-11-16 15:31                                     ` Ric Wheeler
  2012-11-16 15:54                                       ` Howard Chu
  2012-11-16 19:14                                     ` David Lang
  1 sibling, 1 reply; 58+ messages in thread
From: Ric Wheeler @ 2012-11-16 15:31 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On 11/16/2012 10:06 AM, Howard Chu wrote:
> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to understand.
>> "do this set of stuff before doing any of this other set of stuff, but I don't
>> care when any of this gets done" and they fit well with the requirements of the
>> users.
>>
>> Users readily accept that if the system crashes, they will loose the most recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.
>
>> but they get annoyed when things get corrupted to the point
>> that they loose the entire file.
>>
>> this includes things like modifying one option and a crash resulting in the
>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>> sync directory, rename file" dance, but the fact that to do so the user must sit
>> and wait for the syncs to take place can be a problem. It would be far better to
>> be able to say "write to temp file, and after it's on disk, rename the file" and
>> not have the user wait. The user doesn't really care if the changes hit disk
>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>> is not any possibility of the rename hitting disk before the file contents.
>>
>> The fact that this could be implemented in multiple ways in the existing
>> hardware does not mean that there need to be multiple ways exposed to userspace,
>> it just means that the cost of doing the operation will vary depending on the
>> hardware that you have. This also means that if new hardware introduces a new
>> way of implementing this, that improvement can be passed on to the users without
>> needing application changes.
>
> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it 
> because they don't know better. We programmers, who know better, have failed 
> to raise a stink and demand that this be fixed.
>   A) Drives should not lose data on power failure. If a drive accepts a write 
> request and says "OK, done" then that data should get written to stable 
> storage, period. Whether it requires capacitors or some other onboard power 
> supply, or whatever, they should just do it. Keep in mind that today, most of 
> the difference between enterprise drives and consumer desktop drives is just a 
> firmware change, that hardware is already identical. Nobody should accept a 
> product that doesn't offer this guarantee. It's inexcusable.
>   B) it should go without saying - drives should reliably report back to the 
> host, when something goes wrong. E.g., if a write request has been accepted, 
> cached, and reported complete, but then during the actual write an ECC failure 
> is detected in the cacheline, the drive needs to tell the host "oh by the way, 
> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>
> If the entire software industry were to simply state "your shit stinks and 
> we're not going to take it any more" the hard drive industry would have no 
> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>
> Once you have drives that are actually trustworthy, actually reliable (which 
> doesn't mean they never fail, it only means they tell the truth about 
> successes or failures), most of these other issues disappear. Most of the need 
> for barriers disappear.
>

I think that you are arguing a fairly silly point.

If you want that behaviour, you have had it for more than a decade - simply 
disable the write cache on your drive and you are done.

If you - as a user - want to run faster and use applications that are coded to 
handle data integrity properly (fsync, fdatasync, etc), leave the write cache 
enabled and use file system barriers.

Everyone has to trade off cost versus something else and this is a very, very 
long standing trade off that drive manufacturers have made.

The more money you pay for your storage, the less likely this is to be an issue 
(high end SSD's, enterprise class arrays, etc don't have volatile write caches 
and most SAS drives perform reasonably well with the write cache disabled).

Regards,

Ric



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-16 15:31                                     ` Ric Wheeler
@ 2012-11-16 15:54                                       ` Howard Chu
  2012-11-16 18:03                                         ` Ric Wheeler
  0 siblings, 1 reply; 58+ messages in thread
From: Howard Chu @ 2012-11-16 15:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

Ric Wheeler wrote:
> On 11/16/2012 10:06 AM, Howard Chu wrote:
>> David Lang wrote:
>>> barriers keep getting mentioned because they are a easy concept to understand.
>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>> care when any of this gets done" and they fit well with the requirements of the
>>> users.
>>>
>>> Users readily accept that if the system crashes, they will loose the most recent
>>> stuff that they did,
>>
>> *some* users may accept that. *None* should.
>>
>>> but they get annoyed when things get corrupted to the point
>>> that they loose the entire file.
>>>
>>> this includes things like modifying one option and a crash resulting in the
>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>> sync directory, rename file" dance, but the fact that to do so the user must sit
>>> and wait for the syncs to take place can be a problem. It would be far better to
>>> be able to say "write to temp file, and after it's on disk, rename the file" and
>>> not have the user wait. The user doesn't really care if the changes hit disk
>>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>>> is not any possibility of the rename hitting disk before the file contents.
>>>
>>> The fact that this could be implemented in multiple ways in the existing
>>> hardware does not mean that there need to be multiple ways exposed to userspace,
>>> it just means that the cost of doing the operation will vary depending on the
>>> hardware that you have. This also means that if new hardware introduces a new
>>> way of implementing this, that improvement can be passed on to the users without
>>> needing application changes.
>>
>> There are a couple industry failures here:
>>
>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>> because they don't know better. We programmers, who know better, have failed
>> to raise a stink and demand that this be fixed.
>>    A) Drives should not lose data on power failure. If a drive accepts a write
>> request and says "OK, done" then that data should get written to stable
>> storage, period. Whether it requires capacitors or some other onboard power
>> supply, or whatever, they should just do it. Keep in mind that today, most of
>> the difference between enterprise drives and consumer desktop drives is just a
>> firmware change, that hardware is already identical. Nobody should accept a
>> product that doesn't offer this guarantee. It's inexcusable.
>>    B) it should go without saying - drives should reliably report back to the
>> host, when something goes wrong. E.g., if a write request has been accepted,
>> cached, and reported complete, but then during the actual write an ECC failure
>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>
>> If the entire software industry were to simply state "your shit stinks and
>> we're not going to take it any more" the hard drive industry would have no
>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>
>> Once you have drives that are actually trustworthy, actually reliable (which
>> doesn't mean they never fail, it only means they tell the truth about
>> successes or failures), most of these other issues disappear. Most of the need
>> for barriers disappear.
>>
>
> I think that you are arguing a fairly silly point.

Seems to me that you're arguing that we should accept inferior technology. 
Who's really being silly?

> If you want that behaviour, you have had it for more than a decade - simply
> disable the write cache on your drive and you are done.

You seem to believe it's nonsensical for someone to want both fast and 
reliable writes, or that it's unreasonable for a storage device to offer the 
same, cheaply. And yet it is clearly trivial to provide all of the above.

> If you - as a user - want to run faster and use applications that are coded to
> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
> enabled and use file system barriers.

Applications aren't supposed to need to worry about such details, that's why 
we have operating systems.

Drives should tell the truth. In event of an error detected after the fact, 
the drive should report the error back to the host. There's nothing 
nonsensical there.

When a drive's cache is enabled, the host should maintain a queue of written 
pages, of a length equal to the size of the drive's cache. If a drive says 
"hey, block XXX failed" the OS can reissue the write from its own queue. No 
muss, no fuss, no performance bottlenecks. This is what Real Computers did 
before the age of VAX Unix.

> Everyone has to trade off cost versus something else and this is a very, very
> long standing trade off that drive manufacturers have made.

With the cost of storage falling as rapidly as it has in recent years, this is 
a stupid tradeoff.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-16 15:54                                       ` Howard Chu
@ 2012-11-16 18:03                                         ` Ric Wheeler
  0 siblings, 0 replies; 58+ messages in thread
From: Ric Wheeler @ 2012-11-16 18:03 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of 
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most 
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user 
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far 
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the 
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as 
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to 
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users 
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>>    A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>>    B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology. 
> Who's really being silly?

No, just suggesting that you either pay for the expensive stuff or learn how to 
use cost effective, high capacity storage like the rest of the world.

I don't disagree that having non-volatile write caches would be nice, but 
everyone has learned how to deal with volatile write caches at the low end of 
market.

>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and 
> reliable writes, or that it's unreasonable for a storage device to offer the 
> same, cheaply. And yet it is clearly trivial to provide all of the above.

I look forward to seeing your products in the market.

Until you have more than "I want" and "I think" on your storage system design 
resume, I suggest you spend the money to get the parts with non-volatile write 
caches or fix your code.

Ric


>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why 
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact, 
> the drive should report the error back to the host. There's nothing 
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written 
> pages, of a length equal to the size of the drive's cache. If a drive says 
> "hey, block XXX failed" the OS can reissue the write from its own queue. No 
> muss, no fuss, no performance bottlenecks. This is what Real Computers did 
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is 
> a stupid tradeoff.
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-16 15:06                                   ` Howard Chu
  2012-11-16 15:31                                     ` Ric Wheeler
@ 2012-11-16 19:14                                     ` David Lang
  1 sibling, 0 replies; 58+ messages in thread
From: David Lang @ 2012-11-16 19:14 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

On Fri, 16 Nov 2012, Howard Chu wrote:

> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to 
>> understand.
>> "do this set of stuff before doing any of this other set of stuff, but I 
>> don't
>> care when any of this gets done" and they fit well with the requirements of 
>> the
>> users.
>> 
>> Users readily accept that if the system crashes, they will loose the most 
>> recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.

when users are given a choice of having all their work be very slow, or have it 
be fast, but in the unlikely event of a crash they loose their mose recent 
changes, they are willing to loose their most recent changes.

If you think about it, this is not much different from the fact that you loose 
all changes since the last time you saved the thing you are working on. Many 
programs save state periodically so that if the application crashes the user 
hasn't lost everything, but any application that tried to save after every 
single change would be so slow that nobody would use it.

There is always going to be a window after a user hits 'save' where the data can 
be lost, because it's not yet on disk.

> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it 
> because they don't know better. We programmers, who know better, have failed 
> to raise a stink and demand that this be fixed.
>  A) Drives should not lose data on power failure. If a drive accepts a write 
> request and says "OK, done" then that data should get written to stable 
> storage, period. Whether it requires capacitors or some other onboard power 
> supply, or whatever, they should just do it. Keep in mind that today, most of 
> the difference between enterprise drives and consumer desktop drives is just 
> a firmware change, that hardware is already identical. Nobody should accept a 
> product that doesn't offer this guarantee. It's inexcusable.

This is an option to you. However if you have enabled write caching and 
reordering, you have explicitly told the system to be faster at the expense of 
loosing data under some conditions. The fact that you then loose data under 
those conditions should not surprise you.

The idea that you must have enough power to write all the pending data to disk 
is problematic as that then severely limits the amount of cache that you have.

>  B) it should go without saying - drives should reliably report back to the 
> host, when something goes wrong. E.g., if a write request has been accepted, 
> cached, and reported complete, but then during the actual write an ECC 
> failure is detected in the cacheline, the drive needs to tell the host "oh by 
> the way, block XXX didn't actually make it to disk like I told you it did 
> 10ms ago."

The issue isn't a drive having a write error, it's the system shutting down 
(or crashing) before the data is written, no OS level tricks will help you here.


The real problem here isn't the drive claiming the data has been written when it 
hasn't, the real problem is that the application has said 'write this data' to 
the OS, and the OS has not done so yet.

The OS delays the writes for many legitimate reasons (the disk may be busy, it 
can get things done more efficently by combining and reordering the writes, etc)

Unless the system crashes, this is not a problem, the data will eventually be 
written out, and on system shutdown everthing is good.

But if the system crashes, some of this postphoned work doesn't get done, and 
that can be a problem.

Applications can do fsync if they want to be sure that their data is safe on 
disk NOW, but they currently have no way of saying "I want to make sure that A 
happens before B, but I don't care if A happens now or 10 seconds from now"

That is the gap that it would be useful to provide a mechanism to deal with, and 
it doesn't matter what your disk system does in terms of lieing ot not, there 
still isn't a way to deal with this today.

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 12:07                                 ` David Lang
  2012-11-16 15:06                                   ` Howard Chu
@ 2012-11-17  5:02                                   ` Vladislav Bolkhovitin
       [not found]                                   ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>
  2 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: David Lang
  Cc: Nico Williams, General Discussion of SQLite Database,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

David Lang, on 11/15/2012 07:07 AM wrote:
>> There's no such thing as "barrier". It is fully artificial abstraction. After
>> all, at the bottom of your stack, you will have to translate it either to cache
>> flush, or commands order enforcement, or both.
>
> When people talk about barriers, they are talking about order enforcement.

Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.

That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.

> the hardware capabilities are not directly accessable from userspace (and they
> probably shouldn't be)

The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.

For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media

1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.

2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.

B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.

Those abstractions can naturally fit all storage capabilities. For instance:

  - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.

  - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.

On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.

> barriers keep getting mentioned because they are a easy concept to understand.

Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                                   ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>
@ 2012-11-17  5:02                                     ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:
> 1. fsync actually does two things at the same time: ordering writes (in a
> barrier-like manner), and forcing cached writes to disk. This makes it very
> difficult to implement fsync efficiently.

Exactly!

> However, logically they are two distinctive functionalities

Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 22:35                                   ` Chris Friesen
@ 2012-11-17  5:02                                     ` Vladislav Bolkhovitin
  2012-11-20  1:23                                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ryan Johnson, General Discussion of SQLite Database,
	Vladislav Bolkhovitin, Nico Williams, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp


Chris Friesen, on 11/15/2012 05:35 PM wrote:
>> The easiest way to implement this fsync would involve three things:
>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>> the affected file, wait for the device to report success, issue a cache
>> flush to the device (or request ordering commands, if available) to make
>> it tell the truth, and wait for the device to report success. AFAIK this
>> already happens, but without taking advantage of any request ordering
>> commands.
>> 2. The requesting thread returns as soon as the kernel has identified
>> all data that will be written back. This is new, but pretty similar to
>> what AIO already does.
>> 3. No write is allowed to enqueue any requests at the device that
>> involve the same file, until all outstanding fsync complete [3]. This is
>> new.
>
> This sounds interesting as a way to expose some useful semantics to userspace.
>
> I assume we'd need to come up with a new syscall or something since it doesn't
> match the behaviour of posix fsync().

This is how I would export cache sync and requests ordering abstractions to the 
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.

For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.

All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.

Vlad

1. See my other today's e-mail about what is immediate cache sync.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-17  5:02                                     ` Vladislav Bolkhovitin
@ 2012-11-20  1:23                                       ` Vladislav Bolkhovitin
  2012-11-26 20:05                                         ` Nico Williams
  0 siblings, 1 reply; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-20  1:23 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ryan Johnson, General Discussion of SQLite Database,
	Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel,
	Richard Hipp

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:
>>> The easiest way to implement this fsync would involve three things:
>>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>>> the affected file, wait for the device to report success, issue a cache
>>> flush to the device (or request ordering commands, if available) to make
>>> it tell the truth, and wait for the device to report success. AFAIK this
>>> already happens, but without taking advantage of any request ordering
>>> commands.
>>> 2. The requesting thread returns as soon as the kernel has identified
>>> all data that will be written back. This is new, but pretty similar to
>>> what AIO already does.
>>> 3. No write is allowed to enqueue any requests at the device that
>>> involve the same file, until all outstanding fsync complete [3]. This is
>>> new.
>>
>> This sounds interesting as a way to expose some useful semantics to userspace.
>>
>> I assume we'd need to come up with a new syscall or something since it doesn't
>> match the behaviour of posix fsync().
>
> This is how I would export cache sync and requests ordering abstractions to the
> user space:
>
> For async IO (io_submit() and friends) I would extend struct iocb by flags, which
> would allow to set the required capabilities, i.e. if this request is FUA, or full
> cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
> each iocb.
>
> For the regular read()/write() I would add to "flags" parameter of
> sync_file_range() one more flag: if this sync is immediate or not.
>
> To enforce ordering rules I would add one more command to fcntl(). It would make
> the latest submitted write in this fd ORDERED.

Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.

For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.

(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-20  1:23                                       ` Vladislav Bolkhovitin
@ 2012-11-26 20:05                                         ` Nico Williams
  2012-11-29  2:15                                           ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 58+ messages in thread
From: Nico Williams @ 2012-11-26 20:05 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Friesen, Ryan Johnson,
	General Discussion of SQLite Database, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp

Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this.  Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.

For some filesystems it is possible to configure fsync() to act as a
barrier: for example, ZFS can be told to perform no synchronous
operations for a given dataset, in which case fsync() devolves into a
simple barrier.  (Cue Simon to tell us that some hardware and some
OSes, and some filesystems simply cannot implement fsync(), with or
without synchronicity.)

So just give us a barrier.  Yes, I know, it's tricky to implement, but
it'd be OK to return EOPNOSUPP, and let the app do something else
(e.g., call fsync() instead, tell the user to expect instability, tell
the user to get a better system, ...).

As for implementation, it helps to have a journalled or log-structured
filesystem.  It also helps to have hardware synchronization primitives
that don't suck, but these aren't entirely necessary: ZFS, for
example, can recover [*] from N incomplete transactions[**], and still
provides fsync() as a barrier given its on-disk structure and the ZIL.
 Note that ZFS recovery from incomplete transactions should never be
necessary where the HW has proper cache flush support, but the
recovery functionality was added precisely because of lousy hardware.

[*]   At volume import time, such as at boot-time.
[**] Granted, this requires user input, but if the user didn't care it
could be made automatic.

Nico
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-26 20:05                                         ` Nico Williams
@ 2012-11-29  2:15                                           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 58+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-29  2:15 UTC (permalink / raw)
  To: Nico Williams
  Cc: Chris Friesen, Ryan Johnson,
	General Discussion of SQLite Database, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp


Nico Williams, on 11/26/2012 03:05 PM wrote:
> Vlad,
>
> You keep saying that programmers don't understand "barriers".  You've
> provided no evidence of this. Meanwhile memory barriers are generally
> well understood, and every programmer I know understands that a
> "barrier" is a synchronization primitive that says that all operations
> of a certain type will have completed prior to the barrier returning
> control to its caller.

Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.

Simplifying, memory barrier instructions are not "cache flush" of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.

There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.

As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?

> So just give us a barrier.

Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.

If you look at it, it offers exactly what you want, only named correctly.

Vlad

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2012-11-29  2:15 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CALwJ=MzHjAOs4J4kGH6HLdwP8E88StDWyAPVumNg9zCWpS9Tdg@mail.gmail.com>
2012-10-10 17:17 ` light weight write barriers Andi Kleen
2012-10-11 16:32   ` [sqlite] " 杨苏立 Yang Su Li
2012-10-11 17:41     ` Christoph Hellwig
2012-10-23 19:53     ` Vladislav Bolkhovitin
2012-10-24 21:17       ` Nico Williams
2012-10-24 22:03         ` david
2012-10-25  0:20           ` Nico Williams
2012-10-25  1:04             ` david
2012-10-25  5:18               ` Nico Williams
2012-10-25  6:02                 ` Theodore Ts'o
2012-10-25  6:58                   ` david
2012-10-25 14:03                     ` Theodore Ts'o
2012-10-25 18:03                       ` david
2012-10-25 18:29                         ` Theodore Ts'o
2012-11-05 20:03                           ` Pavel Machek
2012-11-05 22:04                             ` Theodore Ts'o
     [not found]                               ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com>
2012-11-05 23:00                                 ` Theodore Ts'o
2012-10-30 23:49                   ` Nico Williams
2012-10-25  5:42           ` Theodore Ts'o
2012-10-25  7:11             ` david
2012-10-27  1:52         ` Vladislav Bolkhovitin
2012-10-25  5:14       ` Theodore Ts'o
2012-10-25 13:03         ` Alan Cox
2012-10-25 13:50           ` Theodore Ts'o
2012-10-27  1:55             ` Vladislav Bolkhovitin
2012-10-27  1:54         ` Vladislav Bolkhovitin
2012-10-27  4:44           ` Theodore Ts'o
2012-10-30 22:22             ` Vladislav Bolkhovitin
2012-10-31  9:54               ` Alan Cox
2012-11-01 20:18                 ` Vladislav Bolkhovitin
2012-11-01 21:24                   ` Alan Cox
2012-11-02  0:15                     ` Vladislav Bolkhovitin
2012-11-02  0:38                     ` Howard Chu
2012-11-02 12:33                       ` Alan Cox
2012-11-13  3:41                         ` Vladislav Bolkhovitin
2012-11-13 17:40                           ` Alan Cox
2012-11-13 19:13                             ` Nico Williams
2012-11-15  1:17                               ` Vladislav Bolkhovitin
2012-11-15 12:07                                 ` David Lang
2012-11-16 15:06                                   ` Howard Chu
2012-11-16 15:31                                     ` Ric Wheeler
2012-11-16 15:54                                       ` Howard Chu
2012-11-16 18:03                                         ` Ric Wheeler
2012-11-16 19:14                                     ` David Lang
2012-11-17  5:02                                   ` Vladislav Bolkhovitin
     [not found]                                   ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>
2012-11-17  5:02                                     ` Vladislav Bolkhovitin
2012-11-15 17:06                                 ` Ryan Johnson
2012-11-15 22:35                                   ` Chris Friesen
2012-11-17  5:02                                     ` Vladislav Bolkhovitin
2012-11-20  1:23                                       ` Vladislav Bolkhovitin
2012-11-26 20:05                                         ` Nico Williams
2012-11-29  2:15                                           ` Vladislav Bolkhovitin
2012-11-15  1:16                             ` Vladislav Bolkhovitin
2012-11-13  3:37                       ` Vladislav Bolkhovitin
     [not found]                       ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>
2012-11-13  3:41                         ` Vladislav Bolkhovitin
     [not found]           ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>
2012-11-13  3:42             ` Vladislav Bolkhovitin
     [not found]   ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>
2012-10-11 16:38     ` Nico Williams
2012-10-11 16:48       ` Nico Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).