* light weight write barriers [not found] <CALwJ=MzHjAOs4J4kGH6HLdwP8E88StDWyAPVumNg9zCWpS9Tdg@mail.gmail.com> @ 2012-10-10 17:17 ` Andi Kleen 2012-10-11 16:32 ` [sqlite] " 杨苏立 Yang Su Li [not found] ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com> 0 siblings, 2 replies; 58+ messages in thread From: Andi Kleen @ 2012-10-10 17:17 UTC (permalink / raw) To: linux-kernel, sqlite-users, linux-fsdevel, drh Richard Hipp writes: > > We would really, really love to have some kind of write-barrier that is > lighter than fsync(). If there is some method other than fsync() for > forcing a write-barrier on Linux that we don't know about, please enlighten > us. Could you list the requirements of such a light weight barrier? i.e. what would it need to do minimally, what's different from fsync/fdatasync ? -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-10 17:17 ` light weight write barriers Andi Kleen @ 2012-10-11 16:32 ` 杨苏立 Yang Su Li 2012-10-11 17:41 ` Christoph Hellwig 2012-10-23 19:53 ` Vladislav Bolkhovitin [not found] ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com> 1 sibling, 2 replies; 58+ messages in thread From: 杨苏立 Yang Su Li @ 2012-10-11 16:32 UTC (permalink / raw) To: General Discussion of SQLite Database; +Cc: linux-kernel, linux-fsdevel, drh I am not quite whether I should ask this question here, but in terms of light weight barrier/fsync, could anyone tell me why the device driver / OS provide the barrier interface other than some other abstractions anyway? I am sorry if this sounds like a stupid questions or it has been discussed before.... I mean, most of the time, we only need some ordering in writes; not complete order, but partial,very simple topological order. And a barrier seems to be a heavy weighted solution to achieve this anyway: you have to finish all writes before the barrier, then start all writes issued after the barrier. That is some ordering which is much stronger than what we need, isn't it? As most of the time the order we need do not involve too many blocks (certainly a lot less than all the cached blocks in the system or in the disk's cache), that topological order isn't likely to be very complicated, and I image it could be implemented efficiently in a modern device, which already has complicated caching/garbage collection/whatever going on internally. Particularly, it seems not too hard to be implemented on top of SCSI's ordered/simple task mode? (I believe Windows does this to an extent, but not quite sure). Thanks a lot Suli On Wed, Oct 10, 2012 at 12:17 PM, Andi Kleen <andi@firstfloor.org> wrote: > Richard Hipp writes: >> >> We would really, really love to have some kind of write-barrier that is >> lighter than fsync(). If there is some method other than fsync() for >> forcing a write-barrier on Linux that we don't know about, please enlighten >> us. > > Could you list the requirements of such a light weight barrier? > i.e. what would it need to do minimally, what's different from > fsync/fdatasync ? > > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-11 16:32 ` [sqlite] " 杨苏立 Yang Su Li @ 2012-10-11 17:41 ` Christoph Hellwig 2012-10-23 19:53 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 58+ messages in thread From: Christoph Hellwig @ 2012-10-11 17:41 UTC (permalink / raw) To: ????????? Yang Su Li Cc: General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh On Thu, Oct 11, 2012 at 11:32:27AM -0500, ????????? Yang Su Li wrote: > I am not quite whether I should ask this question here, but in terms > of light weight barrier/fsync, could anyone tell me why the device > driver / OS provide the barrier interface other than some other > abstractions anyway? I am sorry if this sounds like a stupid questions > or it has been discussed before.... It does not. Except for the legacy mount option naming there is no such thing as a barrier in Linux these days. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-11 16:32 ` [sqlite] " 杨苏立 Yang Su Li 2012-10-11 17:41 ` Christoph Hellwig @ 2012-10-23 19:53 ` Vladislav Bolkhovitin 2012-10-24 21:17 ` Nico Williams 2012-10-25 5:14 ` Theodore Ts'o 1 sibling, 2 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-10-23 19:53 UTC (permalink / raw) To: 杨苏立 Yang Su Li Cc: General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh 杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote: > I am not quite whether I should ask this question here, but in terms > of light weight barrier/fsync, could anyone tell me why the device > driver / OS provide the barrier interface other than some other > abstractions anyway? I am sorry if this sounds like a stupid questions > or it has been discussed before.... > > I mean, most of the time, we only need some ordering in writes; not > complete order, but partial,very simple topological order. And a > barrier seems to be a heavy weighted solution to achieve this anyway: > you have to finish all writes before the barrier, then start all > writes issued after the barrier. That is some ordering which is much > stronger than what we need, isn't it? > > As most of the time the order we need do not involve too many blocks > (certainly a lot less than all the cached blocks in the system or in > the disk's cache), that topological order isn't likely to be very > complicated, and I image it could be implemented efficiently in a > modern device, which already has complicated caching/garbage > collection/whatever going on internally. Particularly, it seems not > too hard to be implemented on top of SCSI's ordered/simple task mode? Yes, SCSI has full support for ordered/simple commands designed exactly for that task: to have steady flow of commands even in case when some of them are ordered. It also has necessary facilities to handle commands errors without unexpected reorders of their subsequent commands (ACA, etc.). Those allow to get full storage performance by fully "fill the pipe", using networking terms. I can easily imaging real life configs, where it can bring 2+ times more performance, than with queue flushing. In fact, AFAIK, AIX requires from storage to support ordered commands and ACA. Implementation should be relatively easy as well, because all transports naturally have link as the point of serialization, so all you need in multithreaded environment is to pass some SN from the point when each ORDERED command created to the point when it sent to the link and make sure that no SIMPLE commands can ever cross ORDERED commands. You can see how it is implemented in SCST in an elegant and lockless manner (for SIMPLE commands). But historically for some reason Linux storage developers were stuck with "barriers" concept, which is obviously not the same as ORDERED commands, hence had a lot troubles with their ambiguous semantic. As far as I can tell the reason of that was some lack of sufficiently deep SCSI understanding (how to handle errors, believe that ACA is something legacy from parallel SCSI times, etc.). Hopefully, eventually the storage developers will realize the value behind ordered commands and learn corresponding SCSI facilities to deal with them. It's quite easy to demonstrate this value, if you know where to look at and not blindly refusing such possibility. I have already tried to explain it a couple of times, but was not successful. Before that happens, people will keep returning again and again with those simple questions: why the queue must be flushed for any ordered operation? Isn't is an obvious overkill? Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-23 19:53 ` Vladislav Bolkhovitin @ 2012-10-24 21:17 ` Nico Williams 2012-10-24 22:03 ` david 2012-10-27 1:52 ` Vladislav Bolkhovitin 2012-10-25 5:14 ` Theodore Ts'o 1 sibling, 2 replies; 58+ messages in thread From: Nico Williams @ 2012-10-24 21:17 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Tue, Oct 23, 2012 at 2:53 PM, Vladislav Bolkhovitin <vvvvvst@gmail.com> wrote: >> As most of the time the order we need do not involve too many blocks >> (certainly a lot less than all the cached blocks in the system or in >> the disk's cache), that topological order isn't likely to be very >> complicated, and I image it could be implemented efficiently in a >> modern device, which already has complicated caching/garbage >> collection/whatever going on internally. Particularly, it seems not >> too hard to be implemented on top of SCSI's ordered/simple task mode? If you have multiple layers involved (e.g., SQLite then the filesystem, and if the filesystem is spread over multiple storage devices), and if transactions are not bounded, and on top of that if there are other concurrent writers to the same filesystem (even if not the same files) then the set of blocks to write and internal ordering can get complex. In practice filesystems try to break these up into large self-consistent chunks and write those -- ZFS does this, for example -- and this is aided by the lack of transactional semantics in the filesystem. For SQLite with a VFS that talks [i]SCSI directly then things could be much more manageable as there's only one write transaction in progress at any given time. But that's not realistic, except, perhaps, in some embedded systems. > Yes, SCSI has full support for ordered/simple commands designed exactly for > that task: [...] > > [...] > > But historically for some reason Linux storage developers were stuck with > "barriers" concept, which is obviously not the same as ORDERED commands, > hence had a lot troubles with their ambiguous semantic. As far as I can tell > the reason of that was some lack of sufficiently deep SCSI understanding > (how to handle errors, believe that ACA is something legacy from parallel > SCSI times, etc.). Barriers are a very simple abstraction, so there's that. > Hopefully, eventually the storage developers will realize the value behind > ordered commands and learn corresponding SCSI facilities to deal with them. > It's quite easy to demonstrate this value, if you know where to look at and > not blindly refusing such possibility. I have already tried to explain it a > couple of times, but was not successful. Exposing ordering of lower-layer operations to filesystem applications is a non-starter. About the only reasonable thing to do with a filesystem is add barrier operations. I know, you're talking about lower layer capabilities, and SQLite could talk to that layer directly, but let's face it: it's not likely to. > Before that happens, people will keep returning again and again with those > simple questions: why the queue must be flushed for any ordered operation? > Isn't is an obvious overkill? That [cache flushing] is not what's being asked for here. Just a light-weight barrier. My proposal works without having to add new system calls: a) use a COW format, b) have background threads doing fsync()s, c) in each transaction's root block note the last known-committed (from a completed fsync()) transaction's root block, d) have an array of well-known ubberblocks large enough to accommodate as many transactions as possible without having to wait for any one fsync() to complete, d) do not reclaim space from any one past transaction until at least one subsequent transaction is fully committed. This obtains ACI- transaction semantics (survives power failures but without durability for the last N transactions at power-failure time) without requiring changes to the OS at all, and with support for delayed D (durability) notification. Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-24 21:17 ` Nico Williams @ 2012-10-24 22:03 ` david 2012-10-25 0:20 ` Nico Williams 2012-10-25 5:42 ` Theodore Ts'o 2012-10-27 1:52 ` Vladislav Bolkhovitin 1 sibling, 2 replies; 58+ messages in thread From: david @ 2012-10-24 22:03 UTC (permalink / raw) To: Nico Williams Cc: General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Wed, 24 Oct 2012, Nico Williams wrote: >> Before that happens, people will keep returning again and again with those >> simple questions: why the queue must be flushed for any ordered operation? >> Isn't is an obvious overkill? > > That [cache flushing] is not what's being asked for here. Just a > light-weight barrier. My proposal works without having to add new > system calls: a) use a COW format, b) have background threads doing > fsync()s, c) in each transaction's root block note the last > known-committed (from a completed fsync()) transaction's root block, > d) have an array of well-known ubberblocks large enough to accommodate > as many transactions as possible without having to wait for any one > fsync() to complete, d) do not reclaim space from any one past > transaction until at least one subsequent transaction is fully > committed. This obtains ACI- transaction semantics (survives power > failures but without durability for the last N transactions at > power-failure time) without requiring changes to the OS at all, and > with support for delayed D (durability) notification. I'm doing some work with rsyslog and it's disk-baded queues and there is a similar issue there. The good news is that we can have a version that is linux specific (rsyslog is used on other OSs, but there is an existing queue implementation that they can use, if the faster one is linux-only, but is significantly faster, that's just a win for Linux) Like what is being described for sqlite, loosing the tail end of the messages is not a big problem under normal conditions. But there is a need to be sure that what is there is complete up to the point where it's lost. this is similar in concept to write-ahead-logs done for databases (without the absolute durability requirement) 1. new messages arrive and get added to the end of the queue file. 2. a thread updates the queue to indicate that it is in the process of delivering a block of messages 3. the thread updates the queue to indicate that the block of messages has been delivered 4. garbage collection happens to delete the old messages to free up space (if queues go into files, this can just be to limit the file size, spilling to multiple files, and when an old file is completely marked as delivered, delete it) I am not fully understanding how what you are describing (COW, separate fsync threads, etc) would be implemented on top of existing filesystems. Most of what you are describing seems like it requires access to the underlying storage to implement. could you give a more detailed explination? David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-24 22:03 ` david @ 2012-10-25 0:20 ` Nico Williams 2012-10-25 1:04 ` david 2012-10-25 5:42 ` Theodore Ts'o 1 sibling, 1 reply; 58+ messages in thread From: Nico Williams @ 2012-10-25 0:20 UTC (permalink / raw) To: david Cc: General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Wed, Oct 24, 2012 at 5:03 PM, <david@lang.hm> wrote: > I'm doing some work with rsyslog and it's disk-baded queues and there is a > similar issue there. The good news is that we can have a version that is > linux specific (rsyslog is used on other OSs, but there is an existing queue > implementation that they can use, if the faster one is linux-only, but is > significantly faster, that's just a win for Linux) > > Like what is being described for sqlite, loosing the tail end of the > messages is not a big problem under normal conditions. But there is a need > to be sure that what is there is complete up to the point where it's lost. > > this is similar in concept to write-ahead-logs done for databases (without > the absolute durability requirement) > > [...] > > I am not fully understanding how what you are describing (COW, separate > fsync threads, etc) would be implemented on top of existing filesystems. > Most of what you are describing seems like it requires access to the > underlying storage to implement. > > could you give a more detailed explination? COW is "copy on write", which is actually a bit of a misnomer -- all COW means is that blocks aren't over-written, instead new blocks are written. In particular this means that inodes, indirect blocks, data blocks, and so on, that are changed are actually written to new locations, and the on-disk format needs to handle this indirection. As for fsyn() and background threads... fsync() is synchronous, but in this scheme we want it to happen asynchronously and then we want to update each transaction with a pointer to the last transaction that is known stable given an fsync()'s return. Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 0:20 ` Nico Williams @ 2012-10-25 1:04 ` david 2012-10-25 5:18 ` Nico Williams 0 siblings, 1 reply; 58+ messages in thread From: david @ 2012-10-25 1:04 UTC (permalink / raw) To: Nico Williams Cc: General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Wed, 24 Oct 2012, Nico Williams wrote: > On Wed, Oct 24, 2012 at 5:03 PM, <david@lang.hm> wrote: >> I'm doing some work with rsyslog and it's disk-baded queues and there is a >> similar issue there. The good news is that we can have a version that is >> linux specific (rsyslog is used on other OSs, but there is an existing queue >> implementation that they can use, if the faster one is linux-only, but is >> significantly faster, that's just a win for Linux) >> >> Like what is being described for sqlite, loosing the tail end of the >> messages is not a big problem under normal conditions. But there is a need >> to be sure that what is there is complete up to the point where it's lost. >> >> this is similar in concept to write-ahead-logs done for databases (without >> the absolute durability requirement) >> >> [...] >> >> I am not fully understanding how what you are describing (COW, separate >> fsync threads, etc) would be implemented on top of existing filesystems. >> Most of what you are describing seems like it requires access to the >> underlying storage to implement. >> >> could you give a more detailed explination? > > COW is "copy on write", which is actually a bit of a misnomer -- all > COW means is that blocks aren't over-written, instead new blocks are > written. In particular this means that inodes, indirect blocks, data > blocks, and so on, that are changed are actually written to new > locations, and the on-disk format needs to handle this indirection. so how can you do this, and keep the writes in order (especially between two files) without being the filesystem? > As for fsyn() and background threads... fsync() is synchronous, but in > this scheme we want it to happen asynchronously and then we want to > update each transaction with a pointer to the last transaction that is > known stable given an fsync()'s return. If you could specify ordering between two writes, I could see a process along the lines of Append new message to file1 append tiny status updates to file2 every million messages, move to new files. once the last message has been processed for the old set of files, delete them. since file2 is small, you can reconstruct state fairly cheaply But unless you are a filesystem, how can you make sure that the message data is written to file1 before you write the metadata about the message to file2? right now it seems that there is no way for an application to do this other than doing a fsync(file1) before writing the metadata to file2 And there is no way for the application to tell the filesystem to write the data in file2 in order (to make sure that block 3 is not written and then have the system crash before block 2 is written), so the application needs to do frequent fsync(file2) calls. If you need complete durability of your data, there are well documented ways of enforcing it (including the lwn.net article http://lwn.net/Articles/457667/ ) But if you don't need the gurantee that your data is on disk now, you just need to have it ordered so that if you crash you can be guaranteed only to loose data off of the tail of your file, there doesn't seem to be any way to do this other than using the fsync() hammer and wait for the overhead of forcing the data to disk now. Or, as I type this, it occurs to me that you may be saying that every time you want to do an ordering guarantee, spawn a new thread to do the fsync and then just keep processing. The fsync will happen at some point, and the writes will not be re-ordered across the fsync, but you can keep going, writing more data while the fsync's are pending. Then if you have a filesystem and I/O subsystem that can consolodate the fwyncs from all the different threads together into one I/O operation without having to flush the entire I/O queue for each one, you can get acceptable performance, with ordering. If the system crashes, data that hasn't had it's fsync() complete will be the only thing that is lost. David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 1:04 ` david @ 2012-10-25 5:18 ` Nico Williams 2012-10-25 6:02 ` Theodore Ts'o 0 siblings, 1 reply; 58+ messages in thread From: Nico Williams @ 2012-10-25 5:18 UTC (permalink / raw) To: david Cc: General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Wed, Oct 24, 2012 at 8:04 PM, <david@lang.hm> wrote: > On Wed, 24 Oct 2012, Nico Williams wrote: >> COW is "copy on write", which is actually a bit of a misnomer -- all >> COW means is that blocks aren't over-written, instead new blocks are >> written. In particular this means that inodes, indirect blocks, data >> blocks, and so on, that are changed are actually written to new >> locations, and the on-disk format needs to handle this indirection. > > so how can you do this, and keep the writes in order (especially between two > files) without being the filesystem? By trusting fsync(). And if you don't care about immediate Durability you can run the fsync() in a background thread and mark the associated transaction as completed in the next transaction to be written after the fsync() completes. >> As for fsyn() and background threads... fsync() is synchronous, but in >> this scheme we want it to happen asynchronously and then we want to >> update each transaction with a pointer to the last transaction that is >> known stable given an fsync()'s return. > > If you could specify ordering between two writes, I could see a process > along the lines of > > [...] fsync() deals with just one file. fsync()s of different files are another story. That said, as long as the format of the two files is COW then you can still compose transactions involving two files. The key is the file contents itself must be COW-structured. Incidentally, here's a single-file, bag of b-trees that uses a COW format: MDB, which can be found in git://git.openldap.org/openldap.git, in the mdb.master branch. > Or, as I type this, it occurs to me that you may be saying that every time > you want to do an ordering guarantee, spawn a new thread to do the fsync and > then just keep processing. The fsync will happen at some point, and the > writes will not be re-ordered across the fsync, but you can keep going, > writing more data while the fsync's are pending. Yes, but only if the file's format is COWish. The point is that COW saves the day. A file-based DB needs to be COW. And the filesystem needs to be as well. Note that write ahead logging approximates COW well enough most of the time. > Then if you have a filesystem and I/O subsystem that can consolodate the > fwyncs from all the different threads together into one I/O operation > without having to flush the entire I/O queue for each one, you can get > acceptable performance, with ordering. If the system crashes, data that > hasn't had it's fsync() complete will be the only thing that is lost. With the above caveat, yes. Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 5:18 ` Nico Williams @ 2012-10-25 6:02 ` Theodore Ts'o 2012-10-25 6:58 ` david 2012-10-30 23:49 ` Nico Williams 0 siblings, 2 replies; 58+ messages in thread From: Theodore Ts'o @ 2012-10-25 6:02 UTC (permalink / raw) To: Nico Williams Cc: david, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote: > > By trusting fsync(). And if you don't care about immediate Durability > you can run the fsync() in a background thread and mark the associated > transaction as completed in the next transaction to be written after > the fsync() completes. The challenge is when you have entagled metadata updates. That is, you update file A, and file B, and file A and B might share metadata. In order to sync file A, you also have to update part of the metadata for the updates to file B, which means calculating the dependencies of what you have to drag in can get very complicated. You can keep track of what bits of the metadata you have to undo and then redo before writing out the metadata for fsync(A), but that basically means you have to implement soft updates, and all of the complexity this implies: http://lwn.net/Articles/339337/ If you can keep all of the metadata separate, this can be somewhat mitigated, but usually the block allocation records (regardless of whether you use a tree, or a bitmap, or some other data structure) tends of have entanglement problems. It certainly is not impossible; RDBMS's have implemented this. On the other hand, they generally aren't as fast as file systems for non-transactional workloads, and people really care about performance on those sorts of workloads for file systems. (About a decade ago, Oracle tried to claim that you could run file system workloads using an Oracle databsae as a back-end. Everyone laughed at them, and the idea died a quick, merciful death.) Still, if you want to try to implement such a thing, by all means, give it a try. But I think you'll find that creating a file system that can compete with existing file systems for performance, and *then* also supports a transactional model, is going to be quite a challenge. - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 6:02 ` Theodore Ts'o @ 2012-10-25 6:58 ` david 2012-10-25 14:03 ` Theodore Ts'o 2012-10-30 23:49 ` Nico Williams 1 sibling, 1 reply; 58+ messages in thread From: david @ 2012-10-25 6:58 UTC (permalink / raw) To: Theodore Ts'o Cc: Nico Williams, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Thu, 25 Oct 2012, Theodore Ts'o wrote: > On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote: >> >> By trusting fsync(). And if you don't care about immediate Durability >> you can run the fsync() in a background thread and mark the associated >> transaction as completed in the next transaction to be written after >> the fsync() completes. > > The challenge is when you have entagled metadata updates. That is, > you update file A, and file B, and file A and B might share metadata. > In order to sync file A, you also have to update part of the metadata > for the updates to file B, which means calculating the dependencies of > what you have to drag in can get very complicated. You can keep track > of what bits of the metadata you have to undo and then redo before > writing out the metadata for fsync(A), but that basically means you > have to implement soft updates, and all of the complexity this > implies: http://lwn.net/Articles/339337/ > > If you can keep all of the metadata separate, this can be somewhat > mitigated, but usually the block allocation records (regardless of > whether you use a tree, or a bitmap, or some other data structure) > tends of have entanglement problems. hmm, two thoughts occur to me. 1. to avoid entanglement, put the two files in separate directories 2. take advantage of entaglement to enforce ordering thread 1 (repeated): write new message to file 1, spawn new thread to fsync thread 2: write to file 2 that message1-5 are being worked on thread 2 (later): write to file 2 that messages 1-5 are done when thread 1 spawns the new thread to do the fsync, the system will be forced to write the data to file 2 as of the time it does the fsync. This should make it so that you never have data written to file2 that refers to data that hasn't been written to file1 yet. > It certainly is not impossible; RDBMS's have implemented this. On the > other hand, they generally aren't as fast as file systems for > non-transactional workloads, and people really care about performance > on those sorts of workloads for file systems. the RDBMS's have implemented stronger guarantees than what we are needing A few years ago I was investigating this for logging. With the reliable (RDBMS style) , but inefficent disk queue that rsyslog has, writing to a high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered stable enough at the time to be tested) > Still, if you want to try to implement such a thing, by all means, > give it a try. But I think you'll find that creating a file system > that can compete with existing file systems for performance, and > *then* also supports a transactional model, is going to be quite a > challenge. The question is trying to figure a way to get ordering right with existing filesystms (preferrably without using something too tied to a single filesystem implementation), not try and create a new one. The frustrating thing is that when people point out how things like sqlite are so horribly slow, the reply seems to be "well, that's what you get for doing so many fsyncs, don't do that", when there is a 'problem' like the KDE "config loss" problem a few years ago, the response is "well, that's what you get for not doing fsync" Both responses are correct, from a purely technical point of view. But what's missing is any way to get the result of ordered I/O that will let you do something pretty fast, but with the guarantee that, if you loose data in a crash, the only loss you are risking is that your most recent data may be missing. (either for one file, or using multiple files if that's what it takes) Since this topic came up again, I figured I'd poke a bit and try to either get educated on how to do this "right" or try and see if there's something that could be added to the kernel to make it possible for userspace programs to do this. What I think userspace really needs is something like a barrier function call. "for this fd, don't re-order writes as they go down through the stack" If the hardware is going to reorder things once it hits the hardware, this is going to hurt performance (how much depends on a lot of stuff) but the filesystems are able to make their journals work, so there should be some way to let userspace do some sort of similar ordering David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 6:58 ` david @ 2012-10-25 14:03 ` Theodore Ts'o 2012-10-25 18:03 ` david 0 siblings, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2012-10-25 14:03 UTC (permalink / raw) To: david Cc: Nico Williams, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Wed, Oct 24, 2012 at 11:58:49PM -0700, david@lang.hm wrote: > The frustrating thing is that when people point out how things like > sqlite are so horribly slow, the reply seems to be "well, that's > what you get for doing so many fsyncs, don't do that", when there is > a 'problem' like the KDE "config loss" problem a few years ago, the > response is "well, that's what you get for not doing fsync" Sure... but the answer is to only do the fsync's when you need to. For example, if GNOME and KDE is rewriting the entire registry file each time the application is changing a single registry key, sure, if you rewrite the entire registry file, and then fsync after each rewrite before you replace the file, you will be safe. And if the application needs to update dozens or hundreds of registry keys (or every time the window gets moved or resized), then yes, it will be slow. But the application didn't have to do that! It could have updated all the registry keys in memory, and then update the registry file periodically instead. Similarly, Firefox didn't need to do a sqllite commit after every single time its history file was written, causing a third of a megabyte of write traffic each time you clicked on a web page. It could have batched its updates to the history file, since most of the time, you don't care about making sure the web history is written to stable store before you're allowed to click on a web page and visit the next web page. Or does rsyslog *really* need to issue an fsync after each log message? Or could it batch updates so that every N seconds, it flushes writes to the disk? (And this is a problem with most Android applications as well. Apparently the framework API's are such that it's easier for an application to treat each sqlite statement as an atomic update, so many/most application writers don't use explicit transaction boundaries, so updates don't get batched even though it would be more efficient if they did so.) Sometimes, the answer is not to try to create exotic database like functionality in the file system --- the answer is to be more intelligent at the application leyer. Not only will the application be more portable, it will also in the end be more efficient, since even with the most exotic database technologies, the most efficient transactional commit is the unneeded commit that you optimize away at the application layer. - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 14:03 ` Theodore Ts'o @ 2012-10-25 18:03 ` david 2012-10-25 18:29 ` Theodore Ts'o 0 siblings, 1 reply; 58+ messages in thread From: david @ 2012-10-25 18:03 UTC (permalink / raw) To: Theodore Ts'o Cc: Nico Williams, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Thu, 25 Oct 2012, Theodore Ts'o wrote: > Or does rsyslog *really* need to issue an fsync after each log > message? Or could it batch updates so that every N seconds, it > flushes writes to the disk? In part this depends on how paranoid the admin is. By default rsyslog doesn't do fsyncs, but admins can configure it to do so and can configure the batch size. However, what I'm talking about here is not normal message traffic, it's the case where the admin has decided that they don't want to use the normal inmemory queues, they want to have the queues be on disk so that if the system crashes the queued data will still be there to be processed after the crash (In addition, this can get used to cover cases where you want queue sizes larger than your available RAM) In this case, the extreme, and only at the explicit direction of the admin, is to fsync after every message. The norm is that it's acceptable to loose the last few messages, but loosing a chunk out of the middle of the queue file can cause a whole lot more to be lost, passing the threshold of acceptable. > Sometimes, the answer is not to try to create exotic database like > functionality in the file system --- the answer is to be more > intelligent at the application leyer. Not only will the application > be more portable, it will also in the end be more efficient, since > even with the most exotic database technologies, the most efficient > transactional commit is the unneeded commit that you optimize away at > the application layer. I agree, this is why I'm trying to figure out the recommended way to do this without needing to do full commits. Since in most cases it's acceptable to loose the last few chunks written, if we had some way of specifying ordering, without having to specify "write this NOW", the solution would be pretty obvious. David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 18:03 ` david @ 2012-10-25 18:29 ` Theodore Ts'o 2012-11-05 20:03 ` Pavel Machek 0 siblings, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2012-10-25 18:29 UTC (permalink / raw) To: david Cc: Nico Williams, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote: > I agree, this is why I'm trying to figure out the recommended way to > do this without needing to do full commits. > > Since in most cases it's acceptable to loose the last few chunks > written, if we had some way of specifying ordering, without having > to specify "write this NOW", the solution would be pretty obvious. Well, using data journalling with ext3/4 may do what you want. If you don't do any fsync, the changes will get written every 5 seconds when the automatic journal sync happens (and sub-4k writes will also get coalesced to a 5 second granularity). Even with plain text files, it's pretty easy to tell whether or not the final record is a partially written or not after a crash; just look for a trailing newline. Better yet, if you are writing to multiple log files with data journalling, all of the writes will happen at the same time, and they will be streamed to the file system journal, minimizing random writes for at least the journal writes. - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 18:29 ` Theodore Ts'o @ 2012-11-05 20:03 ` Pavel Machek 2012-11-05 22:04 ` Theodore Ts'o 0 siblings, 1 reply; 58+ messages in thread From: Pavel Machek @ 2012-11-05 20:03 UTC (permalink / raw) To: Theodore Ts'o, david, Nico Williams, General Discussion of SQLite Database, ????????? Yang Su Li, linux-fsdevel, linux-kernel, drh On Thu 2012-10-25 14:29:48, Theodore Ts'o wrote: > On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote: > > I agree, this is why I'm trying to figure out the recommended way to > > do this without needing to do full commits. > > > > Since in most cases it's acceptable to loose the last few chunks > > written, if we had some way of specifying ordering, without having > > to specify "write this NOW", the solution would be pretty obvious. > > Well, using data journalling with ext3/4 may do what you want. If you > don't do any fsync, the changes will get written every 5 seconds when > the automatic journal sync happens (and sub-4k writes will also get Hmm. But that would need setting journalling mode per-file, no? Like, make it journal data for all the databases, but keep normal mode for rest of system... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-05 20:03 ` Pavel Machek @ 2012-11-05 22:04 ` Theodore Ts'o [not found] ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com> 0 siblings, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2012-11-05 22:04 UTC (permalink / raw) To: Pavel Machek Cc: david, Nico Williams, General Discussion of SQLite Database, ????????? Yang Su Li, linux-fsdevel, linux-kernel, drh On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote: > > Well, using data journalling with ext3/4 may do what you want. If you > > don't do any fsync, the changes will get written every 5 seconds when > > the automatic journal sync happens (and sub-4k writes will also get > > Hmm. But that would need setting journalling mode per-file, no? > > Like, make it journal data for all the databases, but keep normal mode > for rest of system... You can do that, using "chattr +j file.db". It's apparently not a well known feature of ext3/4.... - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com>]
* Re: [sqlite] light weight write barriers [not found] ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com> @ 2012-11-05 23:00 ` Theodore Ts'o 0 siblings, 0 replies; 58+ messages in thread From: Theodore Ts'o @ 2012-11-05 23:00 UTC (permalink / raw) To: Richard Hipp Cc: General Discussion of SQLite Database, Pavel Machek, david, Nico Williams, ????????? Yang Su Li, linux-fsdevel, linux-kernel, drh On Mon, Nov 05, 2012 at 05:37:02PM -0500, Richard Hipp wrote: > > Per the docs: "Only the superuser or a process possessing the > CAP_SYS_RESOURCE capability can set or clear this attribute." That > prevents most applications that run SQLite from being able to take > advantage of this, since most such applications lack elevated privileges. If this feature would prove useful to sqllite, that's something we could address. I could image making this available to processes that belong to a specific group that would be specified in the superblock or as a mount option. (We already have something like that which allows a specific uid or gid to use the reserved space in the superblock.) - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 6:02 ` Theodore Ts'o 2012-10-25 6:58 ` david @ 2012-10-30 23:49 ` Nico Williams 1 sibling, 0 replies; 58+ messages in thread From: Nico Williams @ 2012-10-30 23:49 UTC (permalink / raw) To: Theodore Ts'o, Nico Williams, david, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel [Dropping sqlite-users. Note that I'm not subscribed to any of the other lists cc'ed.] On Thu, Oct 25, 2012 at 1:02 AM, Theodore Ts'o <tytso@mit.edu> wrote: > On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote: >> >> By trusting fsync(). And if you don't care about immediate Durability >> you can run the fsync() in a background thread and mark the associated >> transaction as completed in the next transaction to be written after >> the fsync() completes. You are all missing some context which I would have added had I noticed the cc'ing of additional lists. D.R. Hipp asked for a light-weight barrier API from the OS/filesystem, the SQLite use-case being to implement fast ACI_ semantics, without durability (i.e., that it be OK to lose the last few transactions, but not to end up with a corrupt DB, and maintaining atomicity, consistency, and isolation). I noted that a journalled/COW DB file format[0] one could run an fsync() in a "background" thread to act as a barrier, and then note in each transaction the last preceding transaction known to have reached disk (because fsync() returned and the bg thread marked the transaction in question as durable). Then refrain from garbage collecting any transactions not marked as durable. Now, there are some caveats, the main one being that this fails if the filesystem or hardware lie about fsync() / cache flushes. Other caveats include that fsync() used this way can have more impact on filesystem performance than a true light-weight barrier[1], that the filesystem itself might not be powerfail-safe, and maybe a few others. But the point is that fsync() can be used in such a way that one need not wait for a transaction to reach rotating rust stably and still retain powerfail safety without durability for the last few transactions. [0] Like the BSD4.4 log structured filesystem, ZFS, Howard Chu's MDB, and many others. Note that ZFS has a pool-import time option to recover from power failures by ignoring any not completely verifiable transactions and rolling back to the last verifiable one. [1] Think of what ZFS does when there's no ZIL and an fsync() comes along: ZFS will either block the fsync() thread until the current transaction closes or else close the current transaction and possibly write a much smaller transaction, thus losing out on making writes as large and contiguous as possible. > The challenge is when you have entagled metadata updates. That is, > you update file A, and file B, and file A and B might share metadata. > In order to sync file A, you also have to update part of the metadata > for the updates to file B, which means calculating the dependencies of > what you have to drag in can get very complicated. You can keep track > of what bits of the metadata you have to undo and then redo before > writing out the metadata for fsync(A), but that basically means you > have to implement soft updates, and all of the complexity this > implies: http://lwn.net/Articles/339337/ I believe that my suggestion composes for multi-file DB file formats, as long as the sum total forms a COWish on-disk format. Of course, adding more fsync()s, even if run in bg threads, may impact system performance even more (see above). Also, if one has a COWish DB then why use more than one file? If the answer were "to spread contents across devices" one might ask "why not trust the filesystem/volume manager to do that?", but hey. I'm not actually proposing that people try to compose this ACI_ technique though... Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-24 22:03 ` david 2012-10-25 0:20 ` Nico Williams @ 2012-10-25 5:42 ` Theodore Ts'o 2012-10-25 7:11 ` david 1 sibling, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2012-10-25 5:42 UTC (permalink / raw) To: david Cc: Nico Williams, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@lang.hm wrote: > Like what is being described for sqlite, loosing the tail end of the > messages is not a big problem under normal conditions. But there is > a need to be sure that what is there is complete up to the point > where it's lost. > > this is similar in concept to write-ahead-logs done for databases > (without the absolute durability requirement) If that's what you require, and you are using ext3/4, usng data journalling might meet your requirements. It's something you can enable on a per-file basis, via chattr +j; you don't have to force all file systems to use data journaling via the data=journalled mount option. The potential downsides that you may or may not care about for this particular application: (a) This will definitely have a performance impact, especially if you are doing lots of small (less than 4k) writes, since the data blocks will get run through the journal, and will only get written to their final location on disk. (b) You don't get atomicity if the write spans a 4k block boundary. All of the bytes before i_size will be written, so you don't have to worry about "holes"; but the last message written to the log file might be truncated. (c) There will be a performance impact, since the contents of data blocks will be written at least twice (once to the journal, and once to the final location on disk). If you do lots of small, sub-4k writes, the performance might be even worse, since data blocks might be written multiple times to the journal. - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 5:42 ` Theodore Ts'o @ 2012-10-25 7:11 ` david 0 siblings, 0 replies; 58+ messages in thread From: david @ 2012-10-25 7:11 UTC (permalink / raw) To: Theodore Ts'o Cc: Nico Williams, General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh On Thu, 25 Oct 2012, Theodore Ts'o wrote: > On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@lang.hm wrote: >> Like what is being described for sqlite, loosing the tail end of the >> messages is not a big problem under normal conditions. But there is >> a need to be sure that what is there is complete up to the point >> where it's lost. >> >> this is similar in concept to write-ahead-logs done for databases >> (without the absolute durability requirement) > > If that's what you require, and you are using ext3/4, usng data > journalling might meet your requirements. It's something you can > enable on a per-file basis, via chattr +j; you don't have to force all > file systems to use data journaling via the data=journalled mount > option. > > The potential downsides that you may or may not care about for this > particular application: > > (a) This will definitely have a performance impact, especially if you > are doing lots of small (less than 4k) writes, since the data blocks > will get run through the journal, and will only get written to their > final location on disk. > > (b) You don't get atomicity if the write spans a 4k block boundary. > All of the bytes before i_size will be written, so you don't have to > worry about "holes"; but the last message written to the log file > might be truncated. > > (c) There will be a performance impact, since the contents of data > blocks will be written at least twice (once to the journal, and once > to the final location on disk). If you do lots of small, sub-4k > writes, the performance might be even worse, since data blocks might > be written multiple times to the journal. I'll have to dig into this option. In the case of rsyslog it sounds like it could work (not as good as a filesystem independant way of doing things, but better than full fsyncs) Truncated messages are not great, but they are a detectable, and acceptable risk. while the average message size is much smaller than 4K (on my network it's ~250 bytes), the metadata that's broken out expands this somewhat, and we can afford to waste disk space if it makes things safer or more efficient. If we do update in place with flags with each message, each message will need to be written up to three times (on recipt, being processed, finished processed). With high message burst rates, I'm worried that we would fill up the journal, is there a good way to deal with this? I believe that ext4 can put the journal on a different device from the filesystem, would this help a lot? If you were to put the journal for an ext4 filesystem on a ram disk, you would loose the data recovery protection of the journal, but could you use this trick to get ordered data writes onto the filesystem? David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-24 21:17 ` Nico Williams 2012-10-24 22:03 ` david @ 2012-10-27 1:52 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-10-27 1:52 UTC (permalink / raw) To: Nico Williams Cc: General Discussion of SQLite Database, 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh Nico Williams, on 10/24/2012 05:17 PM wrote: >> Yes, SCSI has full support for ordered/simple commands designed exactly for >> that task: [...] >> >> [...] >> >> But historically for some reason Linux storage developers were stuck with >> "barriers" concept, which is obviously not the same as ORDERED commands, >> hence had a lot troubles with their ambiguous semantic. As far as I can tell >> the reason of that was some lack of sufficiently deep SCSI understanding >> (how to handle errors, believe that ACA is something legacy from parallel >> SCSI times, etc.). > > Barriers are a very simple abstraction, so there's that. It isn't simple at all. If you think for some time about barriers from the storage point of view, you will soon realize how bad and ambiguous they are. >> Before that happens, people will keep returning again and again with those >> simple questions: why the queue must be flushed for any ordered operation? >> Isn't is an obvious overkill? > > That [cache flushing] It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if you like. Often there's a big difference where it's done: on the system side, or on the storage side. Actually, performance improvements from NCQ in many cases are not because it allows the drive to reorder requests, as it's commonly thought, but because it allows to have internal drive's processing stages stay always busy without any idle time. Drives often have a long internal pipeline.. Hence the need to keep every stage of it always busy and hence why using ORDERED commands is important for performance. > is not what's being asked for here. Just a > light-weight barrier. My proposal works without having to add new > system calls: a) use a COW format, b) have background threads doing > fsync()s, c) in each transaction's root block note the last > known-committed (from a completed fsync()) transaction's root block, > d) have an array of well-known ubberblocks large enough to accommodate > as many transactions as possible without having to wait for any one > fsync() to complete, d) do not reclaim space from any one past > transaction until at least one subsequent transaction is fully > committed. This obtains ACI- transaction semantics (survives power > failures but without durability for the last N transactions at > power-failure time) without requiring changes to the OS at all, and > with support for delayed D (durability) notification. I believe what you really want is to be able to send to the storage a sequence of your favorite operations (FS operations, async IO operations, etc.) like: Write back caching disabled: data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ... Write back caching enabled: data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, ..., data op2M, ... Right? (ORDERED means that it is guaranteed that this ordered command never in any circumstances will be executed before any previous command completed AND after any subsequent command completed.) Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-23 19:53 ` Vladislav Bolkhovitin 2012-10-24 21:17 ` Nico Williams @ 2012-10-25 5:14 ` Theodore Ts'o 2012-10-25 13:03 ` Alan Cox 2012-10-27 1:54 ` Vladislav Bolkhovitin 1 sibling, 2 replies; 58+ messages in thread From: Theodore Ts'o @ 2012-10-25 5:14 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote: > Yes, SCSI has full support for ordered/simple commands designed > exactly for that task: to have steady flow of commands even in case > when some of them are ordered..... SCSI does, yes --- *if* the device actually implements Tagged Command Queuing (TCQ). Not all devices do. More importantly, SATA drives do *not* have this capability, and when you compare the price of SATA drives to uber-expensive "enterprise drives", it's not surprising that most people don't actually use SCSI/SAS drives that have implemented TCQ. SATA's Native Command Queuing (NCQ) is not equivalent; this allows the drive to reorder requests (in particular read requests) so they can be serviced more efficiently, but it does *not* allow the OS to specify a partial, relative ordering of requests. Yes, you can turn off writeback caching, but that has pretty huge performance costs; and there is the FUA bit, but that's just an unconditional high priority bypass of the writeback cache, which is useful in some cases, but which again, does not give the ability for the OS to specify a partial order, while letting the drive reorder other requests for efficiency/performance's sake, since the drive has a lot more information about the optimal way to reorder requests based on the current location of the drive head and where certain blocks may have been remapped due to bad block sparing, etc. > Hopefully, eventually the storage developers will realize the value > behind ordered commands and learn corresponding SCSI facilities to > deal with them. Eventually, drive manufacturers will realize that trying to price guage people who want advanced features such as TCQ, DIF/DIX, is the best way to gaurantee that most people won't bother to purchase them, and hence the features will remain largely unused.... - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 5:14 ` Theodore Ts'o @ 2012-10-25 13:03 ` Alan Cox 2012-10-25 13:50 ` Theodore Ts'o 2012-10-27 1:54 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 58+ messages in thread From: Alan Cox @ 2012-10-25 13:03 UTC (permalink / raw) To: Theodore Ts'o Cc: Vladislav Bolkhovitin, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh > > Hopefully, eventually the storage developers will realize the value > > behind ordered commands and learn corresponding SCSI facilities to > > deal with them. > > Eventually, drive manufacturers will realize that trying to price > guage people who want advanced features such as TCQ, DIF/DIX, is the > best way to gaurantee that most people won't bother to purchase them, > and hence the features will remain largely unused.... I doubt they care. The profit on high end features from the people who really need them I would bet far exceeds any other benefit of giving it to others. Welcome to capitalism 8) Plus - spinning rust for those end users is on the way out, SATA to flash is a bit of hack and people are already putting a lot of focus onto things like NVM Express. Alan ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 13:03 ` Alan Cox @ 2012-10-25 13:50 ` Theodore Ts'o 2012-10-27 1:55 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2012-10-25 13:50 UTC (permalink / raw) To: Alan Cox Cc: Vladislav Bolkhovitin, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote: > > I doubt they care. The profit on high end features from the people who > really need them I would bet far exceeds any other benefit of giving it to > others. Welcome to capitalism 8) Yes, but it's a question of pricing. If they had priced it a just a wee bit higher, then there would have been incentive to add support for TCQ so it could actually be used into various Linux file systems, since there would have been lots of users of it. But as it is, the folks who are purchasing huge, vast number of these drives --- such as at the large cloud providers: Amazon, Facebook, Racespace, et. al. --- will choose to purchase large numbers of commodity drives, and then find ways to work around the missing functionality in userspace. For example, DIF/DIX would be nice, and if it were available for cheap, I could imagine it being used. But you can accomplish the same thing in userspace, and in fact at Google I've implemented a special not-for-mainline patch which spikes out stable writes (required for DIF/DIX) because it has significant performance overhead, and DIF/DIX has zero benefit if you're not willing to shell out $$$ for hardware that supports it. Maybe the HDD manufacturers have been able to price guage a small number enterprise I/T shops with more dollars than sense, but personally, I'm not convinced they picked an optimal pricing strategy.... Put another way, I accept that Toyota should price a Lexus ES more than a Camry, but if it's priced at say, 3x the price of a Camry instead of 20%, they might find that precious few people are willing to pay that kind of money for what is essentially the same car with minor luxury tweaks added to it. > Plus - spinning rust for those end users is on the way out, SATA to flash > is a bit of hack and people are already putting a lot of focus onto > things like NVM Express. Yeah.... I don't buy that. One, flash is still too expensive. Two, the capital costs to build enough Silicon foundries to replace the current production volume of HDD's is way too expensive for any company to afford (the cloud providers are buying *huge* numbers of HDD's) --- and that's assuming companies wouldn't chose to use those foundries for products with larger margins --- such as, for example, CPU/GPU chips. :-) And third and finally, if you study the long-term trends in terms of Data Retention Time (going down), Program and Read Disturb (going up), and Write Endurance (going down) as a function of feature size and/or time, you'd be wise to treat flash as nothing more than short-term cache, and not as a long term stable store. If end users completely give up on flash, and store all of their precious family pictures on flash storage, after a couple of years, they are likely going to be very disappointed.... Speaking personally, I wouldn't want to have anything on flash for more than a few months at *most* before I made sure I had another copy saved on spinning rust platters for long-term retention. - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 13:50 ` Theodore Ts'o @ 2012-10-27 1:55 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-10-27 1:55 UTC (permalink / raw) To: Theodore Ts'o, Alan Cox, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh Theodore Ts'o, on 10/25/2012 09:50 AM wrote: > Yeah.... I don't buy that. One, flash is still too expensive. Two, > the capital costs to build enough Silicon foundries to replace the > current production volume of HDD's is way too expensive for any > company to afford (the cloud providers are buying *huge* numbers of > HDD's) --- and that's assuming companies wouldn't chose to use those > foundries for products with larger margins --- such as, for example, > CPU/GPU chips. :-) And third and finally, if you study the long-term > trends in terms of Data Retention Time (going down), Program and Read > Disturb (going up), and Write Endurance (going down) as a function of > feature size and/or time, you'd be wise to treat flash as nothing more > than short-term cache, and not as a long term stable store. > > If end users completely give up on flash, and store all of their > precious family pictures on flash storage, after a couple of years, > they are likely going to be very disappointed.... > > Speaking personally, I wouldn't want to have anything on flash for > more than a few months at *most* before I made sure I had another copy > saved on spinning rust platters for long-term retention. Here I agree with you. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-25 5:14 ` Theodore Ts'o 2012-10-25 13:03 ` Alan Cox @ 2012-10-27 1:54 ` Vladislav Bolkhovitin 2012-10-27 4:44 ` Theodore Ts'o [not found] ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com> 1 sibling, 2 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-10-27 1:54 UTC (permalink / raw) To: Theodore Ts'o, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh Theodore Ts'o, on 10/25/2012 01:14 AM wrote: > On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote: >> Yes, SCSI has full support for ordered/simple commands designed >> exactly for that task: to have steady flow of commands even in case >> when some of them are ordered..... > > SCSI does, yes --- *if* the device actually implements Tagged Command > Queuing (TCQ). Not all devices do. > > More importantly, SATA drives do *not* have this capability, and when > you compare the price of SATA drives to uber-expensive "enterprise > drives", it's not surprising that most people don't actually use > SCSI/SAS drives that have implemented TCQ. What different in our positions is that you are considering storage as something you can connect to your desktop, while in my view storage is something, which stores data and serves them the best possible way with the best performance. Hence, for you the least common denominator of all storage features is the most important, while for me to get the best of what possible from storage is the most important. In my view storage should offload from the host system as much as possible: data movements, ordered operations requirements, atomic operations, deduplication, snapshots, reliability measures (eg RAIDs), load balancing, etc. It's the same as with 2D/3D video acceleration hardware. If you want the best performance from your system, you should offload from it as much as possible. In case of video - to the video hardware, in case of storage - to the storage. The same as with video, for storage better offload - better performance. On hundreds of thousands IOPS it's clearly visible. Price doesn't matter here, because it's completely different topic. > SATA's Native Command > Queuing (NCQ) is not equivalent; this allows the drive to reorder > requests (in particular read requests) so they can be serviced more > efficiently, but it does *not* allow the OS to specify a partial, > relative ordering of requests. And so? If SATA can't do it, does it mean that nobody else can't do it too? I know a plenty of non-SATA devices, which can do the ordering requirements you need. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-27 1:54 ` Vladislav Bolkhovitin @ 2012-10-27 4:44 ` Theodore Ts'o 2012-10-30 22:22 ` Vladislav Bolkhovitin [not found] ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com> 1 sibling, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2012-10-27 4:44 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote: > What different in our positions is that you are considering storage > as something you can connect to your desktop, while in my view > storage is something, which stores data and serves them the best > possible way with the best performance. I don't get paid to make Linux storage work well for gold-plated storage, and as far as I know, none of the purveyors of said gold plated software systems are currently employing Linux file system developers to make Linux file systems work well on said gold-plated hardware. As for what I might do on my own time, for fun, I can't afford said gold-plated hardware, and personally I get a lot more satisfaction if I know there will be a large number of people who benefit from my work (it was really cool when I found out that millions and millions of Android devices were going to be using ext4 :-), as opposed to a very small number of people who have paid $$$ to storage vendors who don't feel it's worthwhile to pay core Linux file system developers to leverage their hardware. Earlier, you were bemoaning why Linux file system developers weren't paying attention to using said fancy SCSI features. Perhaps now you'll understand better it's not happening? > Price doesn't matter here, because it's completely different topic. It matters if you think I'm going to do it on my own time, out of my own budget. And if you think my employer is going to choose to use said hardware, price definitely matters. I consider engineering to be the art of making tradeoffs, and price is absolutely one of the things that we need to trade off against other goals. It's rare that you get to design something where performance matters above all else. Maybe it's that way if you're paid by folks whose job it is to destablize the world's financial markets by pushing the holes into the right half plane (i.e., high frequency trading :-). But for the rest of the world, price absolutely matters. - Ted P.S. All of the storage I have access to at home is SATA. If someone would like to change that and ship me free hardware, as long as it doesn't require three-phase power (or require some exotic interconnect which is ghastly expensive and which you are also not going to provide me for free), do contact me off-line. :-) ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-27 4:44 ` Theodore Ts'o @ 2012-10-30 22:22 ` Vladislav Bolkhovitin 2012-10-31 9:54 ` Alan Cox 0 siblings, 1 reply; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-10-30 22:22 UTC (permalink / raw) To: Theodore Ts'o, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh Theodore Ts'o, on 10/27/2012 12:44 AM wrote: > On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote: >> What different in our positions is that you are considering storage >> as something you can connect to your desktop, while in my view >> storage is something, which stores data and serves them the best >> possible way with the best performance. > > I don't get paid to make Linux storage work well for gold-plated > storage, and as far as I know, none of the purveyors of said gold > plated software systems are currently employing Linux file system > developers to make Linux file systems work well on said gold-plated > hardware. I don't want to flame on this topic, but you are not right here. As far as I can see, a big chunk of Linux storage and file system developers are/were employed by the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle. You know, RedHat from recent times also stepped to this market, at least I saw their advertisement on SDC 2012. So, you can add here all RedHat employees. > As for what I might do on my own time, for fun, I can't afford said > gold-plated hardware, and personally I get a lot more satisfaction if > I know there will be a large number of people who benefit from my work > (it was really cool when I found out that millions and millions of > Android devices were going to be using ext4 :-), as opposed to a very > small number of people who have paid $$$ to storage vendors who don't > feel it's worthwhile to pay core Linux file system developers to > leverage their hardware. Earlier, you were bemoaning why Linux file > system developers weren't paying attention to using said fancy SCSI > features. Perhaps now you'll understand better it's not happening? > >> Price doesn't matter here, because it's completely different topic. > > It matters if you think I'm going to do it on my own time, out of my > own budget. And if you think my employer is going to choose to use > said hardware, price definitely matters. I consider engineering to be > the art of making tradeoffs, and price is absolutely one of the things > that we need to trade off against other goals. > > It's rare that you get to design something where performance matters > above all else. Maybe it's that way if you're paid by folks whose job > it is to destablize the world's financial markets by pushing the holes > into the right half plane (i.e., high frequency trading :-). But for > the rest of the world, price absolutely matters. I fully understand your position. But "affordable" and "useful" are completely orthogonal things. The "high end" features are very useful, if you want to get high performance. Then ones, who can afford them, will use them, which might be your favorite bank, for instance, hence they will be indirectly working for you. Of course, you don't have to work on those features, especially for free, but you similarly don't have then to call them useless only because they are not affordable to be put in a desktop [1]. Our discussion started not from "value-for-money", but from a constant demand to perform ordered commands without full queue draining, which is ignored by the Linux storage developers for YEARS as not useful, right? Vlad [1] If you or somebody else want to put something supporting all necessary features to perform ORDERED commands, including ACA, in a desktop, you can look at modern SAS SSDs. I can't call price for those devices "high-end". ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-30 22:22 ` Vladislav Bolkhovitin @ 2012-10-31 9:54 ` Alan Cox 2012-11-01 20:18 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 58+ messages in thread From: Alan Cox @ 2012-10-31 9:54 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Theodore Ts'o, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh > I don't want to flame on this topic, but you are not right here. As far as I can > see, a big chunk of Linux storage and file system developers are/were employed by > the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle. > > You know, RedHat from recent times also stepped to this market, at least I saw > their advertisement on SDC 2012. So, you can add here all RedHat employees. Booleans generally should be reserved for logic operators. Most of the Linux companies work on both low and high end storage. The two are not mutually exclusive nor do they divide neatly by market. Many big clouds use cheap low end drives by the crate, some high end desktops are using SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm not sure personally there is much point (and I used to have fibrechannel on my Thinkpad 600 when docked 8)) > Our discussion started not from "value-for-money", but from a constant demand to > perform ordered commands without full queue draining, which is ignored by the > Linux storage developers for YEARS as not useful, right? Send patches with benchmarks demonstrating it is useful. It's really quite simple. Code talks. Alan ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-31 9:54 ` Alan Cox @ 2012-11-01 20:18 ` Vladislav Bolkhovitin 2012-11-01 21:24 ` Alan Cox 0 siblings, 1 reply; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-01 20:18 UTC (permalink / raw) To: Alan Cox Cc: Theodore Ts'o, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh Alan Cox, on 10/31/2012 05:54 AM wrote: >> I don't want to flame on this topic, but you are not right here. As far as I can >> see, a big chunk of Linux storage and file system developers are/were employed by >> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle. >> >> You know, RedHat from recent times also stepped to this market, at least I saw >> their advertisement on SDC 2012. So, you can add here all RedHat employees. > > Booleans generally should be reserved for logic operators. Most of the > Linux companies work on both low and high end storage. The two are not > mutually exclusive nor do they divide neatly by market. Many big clouds > use cheap low end drives by the crate, some high end desktops are using > SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm > not sure personally there is much point Those doesn't contradict the point that high performance storage vendors are also funding Linux kernel storage development. > Send patches with benchmarks demonstrating it is useful. It's really > quite simple. Code talks. How about that recently preliminary infrastructure to send ORDERED commands instead of queue draining was deleted from the kernel, because "there's no difference where to drain the queue, on the kernel or the storage side"? Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-01 20:18 ` Vladislav Bolkhovitin @ 2012-11-01 21:24 ` Alan Cox 2012-11-02 0:15 ` Vladislav Bolkhovitin 2012-11-02 0:38 ` Howard Chu 0 siblings, 2 replies; 58+ messages in thread From: Alan Cox @ 2012-11-01 21:24 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Theodore Ts'o, 杨苏立 Yang Su Li, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh > How about that recently preliminary infrastructure to send ORDERED commands > instead of queue draining was deleted from the kernel, because "there's no > difference where to drain the queue, on the kernel or the storage side"? Send patches. Alan ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-01 21:24 ` Alan Cox @ 2012-11-02 0:15 ` Vladislav Bolkhovitin 2012-11-02 0:38 ` Howard Chu 1 sibling, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-02 0:15 UTC (permalink / raw) To: Alan Cox Cc: Theodore Ts'o, 杨苏立 Yang Su Li, linux-kernel, linux-fsdevel, drh Alan Cox, on 11/01/2012 05:24 PM wrote: >> How about that recently preliminary infrastructure to send ORDERED commands >> instead of queue draining was deleted from the kernel, because "there's no >> difference where to drain the queue, on the kernel or the storage side"? > > Send patches. OK, then we have a good progress! Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-01 21:24 ` Alan Cox 2012-11-02 0:15 ` Vladislav Bolkhovitin @ 2012-11-02 0:38 ` Howard Chu 2012-11-02 12:33 ` Alan Cox ` (2 more replies) 1 sibling, 3 replies; 58+ messages in thread From: Howard Chu @ 2012-11-02 0:38 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: Alan Cox, Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel, linux-fsdevel Alan Cox wrote: >> How about that recently preliminary infrastructure to send ORDERED commands >> instead of queue draining was deleted from the kernel, because "there's no >> difference where to drain the queue, on the kernel or the storage side"? > > Send patches. Isn't any type of kernel-side ordering an exercise in futility, since a) the kernel has no knowledge of the disk's actual geometry b) most drives will internally re-order requests anyway c) cheap drives won't support barriers Even assuming the drives honored all your requests without lying, how would you really want this behavior exposed? From the userland perspective, there are very few apps that care. Probably only transactional databases, really. As a DB author, I'm not sure I'd be keen on this as an open() or fcntl() option. Databases that really care would be on dedicated filesystems and/or devices, so per-file control would be tedious. You would most likely want to say "all writes to this string of devices should be order-preserving" and forget about it. With that guarantee, a careful writer can have perfectly intact data structures all the time, without ever slowing down for a fsync. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-02 0:38 ` Howard Chu @ 2012-11-02 12:33 ` Alan Cox 2012-11-13 3:41 ` Vladislav Bolkhovitin 2012-11-13 3:37 ` Vladislav Bolkhovitin [not found] ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com> 2 siblings, 1 reply; 58+ messages in thread From: Alan Cox @ 2012-11-02 12:33 UTC (permalink / raw) To: Howard Chu Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel, linux-fsdevel > Isn't any type of kernel-side ordering an exercise in futility, since > a) the kernel has no knowledge of the disk's actual geometry > b) most drives will internally re-order requests anyway They will but only as permitted by the commands queued, so you have some control depending upon the interface capabilities. > c) cheap drives won't support barriers Barriers are pretty much universal as you need them for power off ! > Even assuming the drives honored all your requests without lying, how would > you really want this behavior exposed? From the userland perspective, there > are very few apps that care. Probably only transactional databases, really. And file systems internally sometimes. A file system is after all a transactional database of sorts. Alan ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-02 12:33 ` Alan Cox @ 2012-11-13 3:41 ` Vladislav Bolkhovitin 2012-11-13 17:40 ` Alan Cox 0 siblings, 1 reply; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-13 3:41 UTC (permalink / raw) To: Alan Cox Cc: Howard Chu, General Discussion of SQLite Database, Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel, linux-fsdevel Alan Cox, on 11/02/2012 08:33 AM wrote: >> b) most drives will internally re-order requests anyway > > They will but only as permitted by the commands queued, so you have some > control depending upon the interface capabilities. > >> c) cheap drives won't support barriers > > Barriers are pretty much universal as you need them for power off ! I'm afraid, no storage (drives, if you like this term more) at the moment supports barriers and, as far as I know the storage history, has never supported. Instead, what storage does support in this area are: 1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc. 2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, etc. 3. Atomic commands, e.g. scattered writes, which allow to write data in several separate not adjacent blocks in an atomic manner, i.e. guarantee that either all blocks are written or none at all. This is a relatively new functionality, natural for flash storage with its COW internals. Obviously, using such atomic write commands, an application or a file system don't need any journaling anymore. FusionIO reported that after they modified MySQL to use them, they had 50% performance increase. Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, including on the same request. That is the root cause why barrier concept is so evil. If you specify a barrier, how can you say what kind actual action you really want from the storage: cache flush? Or ordered write? Or both? This is why relatively recent removal of barriers from the Linux kernel (http://lwn.net/Articles/400541/) was a big step ahead. The next logical step should be to allow ORDERED attribute for requests be accelerated by ORDERED commands of the storage, if it supports them. If not, fall back to the existing queue draining. Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A simple Google search shows that only Linux uses this concept for storage. And 2 years passed, since they were removed from the kernel, but people still discuss barriers as if they are here. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-13 3:41 ` Vladislav Bolkhovitin @ 2012-11-13 17:40 ` Alan Cox 2012-11-13 19:13 ` Nico Williams 2012-11-15 1:16 ` Vladislav Bolkhovitin 0 siblings, 2 replies; 58+ messages in thread From: Alan Cox @ 2012-11-13 17:40 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Howard Chu, General Discussion of SQLite Database, Theodore Ts'o, drh, linux-kernel, linux-fsdevel > > Barriers are pretty much universal as you need them for power off ! > > I'm afraid, no storage (drives, if you like this term more) at the moment supports > barriers and, as far as I know the storage history, has never supported. The ATA cache flush is a write barrier, and given you have no NV cache visible to the controller it's the same thing. > Instead, what storage does support in this area are: Yes - the devil is in the detail once you go beyond simple capabilities. Alan ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-13 17:40 ` Alan Cox @ 2012-11-13 19:13 ` Nico Williams 2012-11-15 1:17 ` Vladislav Bolkhovitin 2012-11-15 1:16 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 58+ messages in thread From: Nico Williams @ 2012-11-13 19:13 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel On Tue, Nov 13, 2012 at 11:40 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: >> > Barriers are pretty much universal as you need them for power off ! >> >> I'm afraid, no storage (drives, if you like this term more) at the moment supports >> barriers and, as far as I know the storage history, has never supported. > > The ATA cache flush is a write barrier, and given you have no NV cache > visible to the controller it's the same thing. > >> Instead, what storage does support in this area are: > > Yes - the devil is in the detail once you go beyond simple capabilities. Right: barriers are trivial to program with. Ordered writes less so. One could declare all writes to be ordered with respect to each other, but this will almost certainly hurt performance (at least with disks, though probably not SSDs) as opposed to barriers, which order one group of internally-not-order writes relative to another. And declaring groups of internally-unordered writes where the groups are ordered with respect to each other... is practically the same as barriers. There's a lot to be said for simplicity... as long as the system is not so simple as to not work at all. My p.o.v. is that a filesystem write barrier is effectively the same as fsync() with the ability to return sooner (before writes hit stable storage) when the filesystem and hardware support on-disk layouts and primitives which can be used to order writes preceding and succeeding the barrier. Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-13 19:13 ` Nico Williams @ 2012-11-15 1:17 ` Vladislav Bolkhovitin 2012-11-15 12:07 ` David Lang 2012-11-15 17:06 ` Ryan Johnson 0 siblings, 2 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-15 1:17 UTC (permalink / raw) To: Nico Williams Cc: General Discussion of SQLite Database, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel Nico Williams, on 11/13/2012 02:13 PM wrote: > declaring groups of internally-unordered writes where the groups are > ordered with respect to each other... is practically the same as > barriers. Which barriers? Barriers meaning cache flush or barriers meaning commands order, or barriers meaning both? There's no such thing as "barrier". It is fully artificial abstraction. After all, at the bottom of your stack, you will have to translate it either to cache flush, or commands order enforcement, or both. Are you going to invent 3 types of barriers? > There's a lot to be said for simplicity... as long as the system is > not so simple as to not work at all. > > My p.o.v. is that a filesystem write barrier is effectively the same > as fsync() with the ability to return sooner (before writes hit stable > storage) when the filesystem and hardware support on-disk layouts and > primitives which can be used to order writes preceding and succeeding > the barrier. Your mistake is that you are considering barriers as something real, which can do something real for you, while it is just a artificial abstraction apparently invented by people with limited knowledge how storage works, hence having very foggy vision how barriers supposed to be processed by it. A simple wrong answer. Generally, you can invent any abstraction convenient for you, but farther your abstractions from reality of your hardware => less you will get from it with bigger effort. There are no barriers in Linux and not going to be. Accept it. And start instead thinking about offload capabilities your storage can offer to you. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-15 1:17 ` Vladislav Bolkhovitin @ 2012-11-15 12:07 ` David Lang 2012-11-16 15:06 ` Howard Chu ` (2 more replies) 2012-11-15 17:06 ` Ryan Johnson 1 sibling, 3 replies; 58+ messages in thread From: David Lang @ 2012-11-15 12:07 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Nico Williams, General Discussion of SQLite Database, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote: > Nico Williams, on 11/13/2012 02:13 PM wrote: >> declaring groups of internally-unordered writes where the groups are >> ordered with respect to each other... is practically the same as >> barriers. > > Which barriers? Barriers meaning cache flush or barriers meaning commands > order, or barriers meaning both? > > There's no such thing as "barrier". It is fully artificial abstraction. After > all, at the bottom of your stack, you will have to translate it either to > cache flush, or commands order enforcement, or both. When people talk about barriers, they are talking about order enforcement. > Your mistake is that you are considering barriers as something real, which > can do something real for you, while it is just a artificial abstraction > apparently invented by people with limited knowledge how storage works, hence > having very foggy vision how barriers supposed to be processed by it. A > simple wrong answer. > > Generally, you can invent any abstraction convenient for you, but farther > your abstractions from reality of your hardware => less you will get from it > with bigger effort. > > There are no barriers in Linux and not going to be. Accept it. And start > instead thinking about offload capabilities your storage can offer to you. the hardware capabilities are not directly accessable from userspace (and they probably shouldn't be) barriers keep getting mentioned because they are a easy concept to understand. "do this set of stuff before doing any of this other set of stuff, but I don't care when any of this gets done" and they fit well with the requirements of the users. Users readily accept that if the system crashes, they will loose the most recent stuff that they did, but they get annoyed when things get corrupted to the point that they loose the entire file. this includes things like modifying one option and a crash resulting in the config file being blank. Yes, you can do the 'write to temp file, sync file, sync directory, rename file" dance, but the fact that to do so the user must sit and wait for the syncs to take place can be a problem. It would be far better to be able to say "write to temp file, and after it's on disk, rename the file" and not have the user wait. The user doesn't really care if the changes hit disk immediately, or several seconds (or even 10s of seconds) later, as long as there is not any possibility of the rename hitting disk before the file contents. The fact that this could be implemented in multiple ways in the existing hardware does not mean that there need to be multiple ways exposed to userspace, it just means that the cost of doing the operation will vary depending on the hardware that you have. This also means that if new hardware introduces a new way of implementing this, that improvement can be passed on to the users without needing application changes. David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-15 12:07 ` David Lang @ 2012-11-16 15:06 ` Howard Chu 2012-11-16 15:31 ` Ric Wheeler 2012-11-16 19:14 ` David Lang 2012-11-17 5:02 ` Vladislav Bolkhovitin [not found] ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com> 2 siblings, 2 replies; 58+ messages in thread From: Howard Chu @ 2012-11-16 15:06 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: David Lang, Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel David Lang wrote: > barriers keep getting mentioned because they are a easy concept to understand. > "do this set of stuff before doing any of this other set of stuff, but I don't > care when any of this gets done" and they fit well with the requirements of the > users. > > Users readily accept that if the system crashes, they will loose the most recent > stuff that they did, *some* users may accept that. *None* should. > but they get annoyed when things get corrupted to the point > that they loose the entire file. > > this includes things like modifying one option and a crash resulting in the > config file being blank. Yes, you can do the 'write to temp file, sync file, > sync directory, rename file" dance, but the fact that to do so the user must sit > and wait for the syncs to take place can be a problem. It would be far better to > be able to say "write to temp file, and after it's on disk, rename the file" and > not have the user wait. The user doesn't really care if the changes hit disk > immediately, or several seconds (or even 10s of seconds) later, as long as there > is not any possibility of the rename hitting disk before the file contents. > > The fact that this could be implemented in multiple ways in the existing > hardware does not mean that there need to be multiple ways exposed to userspace, > it just means that the cost of doing the operation will vary depending on the > hardware that you have. This also means that if new hardware introduces a new > way of implementing this, that improvement can be passed on to the users without > needing application changes. There are a couple industry failures here: 1) the drive manufacturers sell drives that lie, and consumers accept it because they don't know better. We programmers, who know better, have failed to raise a stink and demand that this be fixed. A) Drives should not lose data on power failure. If a drive accepts a write request and says "OK, done" then that data should get written to stable storage, period. Whether it requires capacitors or some other onboard power supply, or whatever, they should just do it. Keep in mind that today, most of the difference between enterprise drives and consumer desktop drives is just a firmware change, that hardware is already identical. Nobody should accept a product that doesn't offer this guarantee. It's inexcusable. B) it should go without saying - drives should reliably report back to the host, when something goes wrong. E.g., if a write request has been accepted, cached, and reported complete, but then during the actual write an ECC failure is detected in the cacheline, the drive needs to tell the host "oh by the way, block XXX didn't actually make it to disk like I told you it did 10ms ago." If the entire software industry were to simply state "your shit stinks and we're not going to take it any more" the hard drive industry would have no choice but to fix it. And in most cases it would be a zero-cost fix for them. Once you have drives that are actually trustworthy, actually reliable (which doesn't mean they never fail, it only means they tell the truth about successes or failures), most of these other issues disappear. Most of the need for barriers disappear. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-16 15:06 ` Howard Chu @ 2012-11-16 15:31 ` Ric Wheeler 2012-11-16 15:54 ` Howard Chu 2012-11-16 19:14 ` David Lang 1 sibling, 1 reply; 58+ messages in thread From: Ric Wheeler @ 2012-11-16 15:31 UTC (permalink / raw) To: Howard Chu Cc: General Discussion of SQLite Database, David Lang, Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel On 11/16/2012 10:06 AM, Howard Chu wrote: > David Lang wrote: >> barriers keep getting mentioned because they are a easy concept to understand. >> "do this set of stuff before doing any of this other set of stuff, but I don't >> care when any of this gets done" and they fit well with the requirements of the >> users. >> >> Users readily accept that if the system crashes, they will loose the most recent >> stuff that they did, > > *some* users may accept that. *None* should. > >> but they get annoyed when things get corrupted to the point >> that they loose the entire file. >> >> this includes things like modifying one option and a crash resulting in the >> config file being blank. Yes, you can do the 'write to temp file, sync file, >> sync directory, rename file" dance, but the fact that to do so the user must sit >> and wait for the syncs to take place can be a problem. It would be far better to >> be able to say "write to temp file, and after it's on disk, rename the file" and >> not have the user wait. The user doesn't really care if the changes hit disk >> immediately, or several seconds (or even 10s of seconds) later, as long as there >> is not any possibility of the rename hitting disk before the file contents. >> >> The fact that this could be implemented in multiple ways in the existing >> hardware does not mean that there need to be multiple ways exposed to userspace, >> it just means that the cost of doing the operation will vary depending on the >> hardware that you have. This also means that if new hardware introduces a new >> way of implementing this, that improvement can be passed on to the users without >> needing application changes. > > There are a couple industry failures here: > > 1) the drive manufacturers sell drives that lie, and consumers accept it > because they don't know better. We programmers, who know better, have failed > to raise a stink and demand that this be fixed. > A) Drives should not lose data on power failure. If a drive accepts a write > request and says "OK, done" then that data should get written to stable > storage, period. Whether it requires capacitors or some other onboard power > supply, or whatever, they should just do it. Keep in mind that today, most of > the difference between enterprise drives and consumer desktop drives is just a > firmware change, that hardware is already identical. Nobody should accept a > product that doesn't offer this guarantee. It's inexcusable. > B) it should go without saying - drives should reliably report back to the > host, when something goes wrong. E.g., if a write request has been accepted, > cached, and reported complete, but then during the actual write an ECC failure > is detected in the cacheline, the drive needs to tell the host "oh by the way, > block XXX didn't actually make it to disk like I told you it did 10ms ago." > > If the entire software industry were to simply state "your shit stinks and > we're not going to take it any more" the hard drive industry would have no > choice but to fix it. And in most cases it would be a zero-cost fix for them. > > Once you have drives that are actually trustworthy, actually reliable (which > doesn't mean they never fail, it only means they tell the truth about > successes or failures), most of these other issues disappear. Most of the need > for barriers disappear. > I think that you are arguing a fairly silly point. If you want that behaviour, you have had it for more than a decade - simply disable the write cache on your drive and you are done. If you - as a user - want to run faster and use applications that are coded to handle data integrity properly (fsync, fdatasync, etc), leave the write cache enabled and use file system barriers. Everyone has to trade off cost versus something else and this is a very, very long standing trade off that drive manufacturers have made. The more money you pay for your storage, the less likely this is to be an issue (high end SSD's, enterprise class arrays, etc don't have volatile write caches and most SAS drives perform reasonably well with the write cache disabled). Regards, Ric ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-16 15:31 ` Ric Wheeler @ 2012-11-16 15:54 ` Howard Chu 2012-11-16 18:03 ` Ric Wheeler 0 siblings, 1 reply; 58+ messages in thread From: Howard Chu @ 2012-11-16 15:54 UTC (permalink / raw) To: Ric Wheeler Cc: General Discussion of SQLite Database, David Lang, Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel Ric Wheeler wrote: > On 11/16/2012 10:06 AM, Howard Chu wrote: >> David Lang wrote: >>> barriers keep getting mentioned because they are a easy concept to understand. >>> "do this set of stuff before doing any of this other set of stuff, but I don't >>> care when any of this gets done" and they fit well with the requirements of the >>> users. >>> >>> Users readily accept that if the system crashes, they will loose the most recent >>> stuff that they did, >> >> *some* users may accept that. *None* should. >> >>> but they get annoyed when things get corrupted to the point >>> that they loose the entire file. >>> >>> this includes things like modifying one option and a crash resulting in the >>> config file being blank. Yes, you can do the 'write to temp file, sync file, >>> sync directory, rename file" dance, but the fact that to do so the user must sit >>> and wait for the syncs to take place can be a problem. It would be far better to >>> be able to say "write to temp file, and after it's on disk, rename the file" and >>> not have the user wait. The user doesn't really care if the changes hit disk >>> immediately, or several seconds (or even 10s of seconds) later, as long as there >>> is not any possibility of the rename hitting disk before the file contents. >>> >>> The fact that this could be implemented in multiple ways in the existing >>> hardware does not mean that there need to be multiple ways exposed to userspace, >>> it just means that the cost of doing the operation will vary depending on the >>> hardware that you have. This also means that if new hardware introduces a new >>> way of implementing this, that improvement can be passed on to the users without >>> needing application changes. >> >> There are a couple industry failures here: >> >> 1) the drive manufacturers sell drives that lie, and consumers accept it >> because they don't know better. We programmers, who know better, have failed >> to raise a stink and demand that this be fixed. >> A) Drives should not lose data on power failure. If a drive accepts a write >> request and says "OK, done" then that data should get written to stable >> storage, period. Whether it requires capacitors or some other onboard power >> supply, or whatever, they should just do it. Keep in mind that today, most of >> the difference between enterprise drives and consumer desktop drives is just a >> firmware change, that hardware is already identical. Nobody should accept a >> product that doesn't offer this guarantee. It's inexcusable. >> B) it should go without saying - drives should reliably report back to the >> host, when something goes wrong. E.g., if a write request has been accepted, >> cached, and reported complete, but then during the actual write an ECC failure >> is detected in the cacheline, the drive needs to tell the host "oh by the way, >> block XXX didn't actually make it to disk like I told you it did 10ms ago." >> >> If the entire software industry were to simply state "your shit stinks and >> we're not going to take it any more" the hard drive industry would have no >> choice but to fix it. And in most cases it would be a zero-cost fix for them. >> >> Once you have drives that are actually trustworthy, actually reliable (which >> doesn't mean they never fail, it only means they tell the truth about >> successes or failures), most of these other issues disappear. Most of the need >> for barriers disappear. >> > > I think that you are arguing a fairly silly point. Seems to me that you're arguing that we should accept inferior technology. Who's really being silly? > If you want that behaviour, you have had it for more than a decade - simply > disable the write cache on your drive and you are done. You seem to believe it's nonsensical for someone to want both fast and reliable writes, or that it's unreasonable for a storage device to offer the same, cheaply. And yet it is clearly trivial to provide all of the above. > If you - as a user - want to run faster and use applications that are coded to > handle data integrity properly (fsync, fdatasync, etc), leave the write cache > enabled and use file system barriers. Applications aren't supposed to need to worry about such details, that's why we have operating systems. Drives should tell the truth. In event of an error detected after the fact, the drive should report the error back to the host. There's nothing nonsensical there. When a drive's cache is enabled, the host should maintain a queue of written pages, of a length equal to the size of the drive's cache. If a drive says "hey, block XXX failed" the OS can reissue the write from its own queue. No muss, no fuss, no performance bottlenecks. This is what Real Computers did before the age of VAX Unix. > Everyone has to trade off cost versus something else and this is a very, very > long standing trade off that drive manufacturers have made. With the cost of storage falling as rapidly as it has in recent years, this is a stupid tradeoff. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-16 15:54 ` Howard Chu @ 2012-11-16 18:03 ` Ric Wheeler 0 siblings, 0 replies; 58+ messages in thread From: Ric Wheeler @ 2012-11-16 18:03 UTC (permalink / raw) To: Howard Chu Cc: General Discussion of SQLite Database, David Lang, Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel On 11/16/2012 10:54 AM, Howard Chu wrote: > Ric Wheeler wrote: >> On 11/16/2012 10:06 AM, Howard Chu wrote: >>> David Lang wrote: >>>> barriers keep getting mentioned because they are a easy concept to understand. >>>> "do this set of stuff before doing any of this other set of stuff, but I don't >>>> care when any of this gets done" and they fit well with the requirements of >>>> the >>>> users. >>>> >>>> Users readily accept that if the system crashes, they will loose the most >>>> recent >>>> stuff that they did, >>> >>> *some* users may accept that. *None* should. >>> >>>> but they get annoyed when things get corrupted to the point >>>> that they loose the entire file. >>>> >>>> this includes things like modifying one option and a crash resulting in the >>>> config file being blank. Yes, you can do the 'write to temp file, sync file, >>>> sync directory, rename file" dance, but the fact that to do so the user >>>> must sit >>>> and wait for the syncs to take place can be a problem. It would be far >>>> better to >>>> be able to say "write to temp file, and after it's on disk, rename the >>>> file" and >>>> not have the user wait. The user doesn't really care if the changes hit disk >>>> immediately, or several seconds (or even 10s of seconds) later, as long as >>>> there >>>> is not any possibility of the rename hitting disk before the file contents. >>>> >>>> The fact that this could be implemented in multiple ways in the existing >>>> hardware does not mean that there need to be multiple ways exposed to >>>> userspace, >>>> it just means that the cost of doing the operation will vary depending on the >>>> hardware that you have. This also means that if new hardware introduces a new >>>> way of implementing this, that improvement can be passed on to the users >>>> without >>>> needing application changes. >>> >>> There are a couple industry failures here: >>> >>> 1) the drive manufacturers sell drives that lie, and consumers accept it >>> because they don't know better. We programmers, who know better, have failed >>> to raise a stink and demand that this be fixed. >>> A) Drives should not lose data on power failure. If a drive accepts a write >>> request and says "OK, done" then that data should get written to stable >>> storage, period. Whether it requires capacitors or some other onboard power >>> supply, or whatever, they should just do it. Keep in mind that today, most of >>> the difference between enterprise drives and consumer desktop drives is just a >>> firmware change, that hardware is already identical. Nobody should accept a >>> product that doesn't offer this guarantee. It's inexcusable. >>> B) it should go without saying - drives should reliably report back to the >>> host, when something goes wrong. E.g., if a write request has been accepted, >>> cached, and reported complete, but then during the actual write an ECC failure >>> is detected in the cacheline, the drive needs to tell the host "oh by the way, >>> block XXX didn't actually make it to disk like I told you it did 10ms ago." >>> >>> If the entire software industry were to simply state "your shit stinks and >>> we're not going to take it any more" the hard drive industry would have no >>> choice but to fix it. And in most cases it would be a zero-cost fix for them. >>> >>> Once you have drives that are actually trustworthy, actually reliable (which >>> doesn't mean they never fail, it only means they tell the truth about >>> successes or failures), most of these other issues disappear. Most of the need >>> for barriers disappear. >>> >> >> I think that you are arguing a fairly silly point. > > Seems to me that you're arguing that we should accept inferior technology. > Who's really being silly? No, just suggesting that you either pay for the expensive stuff or learn how to use cost effective, high capacity storage like the rest of the world. I don't disagree that having non-volatile write caches would be nice, but everyone has learned how to deal with volatile write caches at the low end of market. > >> If you want that behaviour, you have had it for more than a decade - simply >> disable the write cache on your drive and you are done. > > You seem to believe it's nonsensical for someone to want both fast and > reliable writes, or that it's unreasonable for a storage device to offer the > same, cheaply. And yet it is clearly trivial to provide all of the above. I look forward to seeing your products in the market. Until you have more than "I want" and "I think" on your storage system design resume, I suggest you spend the money to get the parts with non-volatile write caches or fix your code. Ric >> If you - as a user - want to run faster and use applications that are coded to >> handle data integrity properly (fsync, fdatasync, etc), leave the write cache >> enabled and use file system barriers. > > Applications aren't supposed to need to worry about such details, that's why > we have operating systems. > > Drives should tell the truth. In event of an error detected after the fact, > the drive should report the error back to the host. There's nothing > nonsensical there. > > When a drive's cache is enabled, the host should maintain a queue of written > pages, of a length equal to the size of the drive's cache. If a drive says > "hey, block XXX failed" the OS can reissue the write from its own queue. No > muss, no fuss, no performance bottlenecks. This is what Real Computers did > before the age of VAX Unix. > >> Everyone has to trade off cost versus something else and this is a very, very >> long standing trade off that drive manufacturers have made. > > With the cost of storage falling as rapidly as it has in recent years, this is > a stupid tradeoff. > ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-16 15:06 ` Howard Chu 2012-11-16 15:31 ` Ric Wheeler @ 2012-11-16 19:14 ` David Lang 1 sibling, 0 replies; 58+ messages in thread From: David Lang @ 2012-11-16 19:14 UTC (permalink / raw) To: Howard Chu Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel On Fri, 16 Nov 2012, Howard Chu wrote: > David Lang wrote: >> barriers keep getting mentioned because they are a easy concept to >> understand. >> "do this set of stuff before doing any of this other set of stuff, but I >> don't >> care when any of this gets done" and they fit well with the requirements of >> the >> users. >> >> Users readily accept that if the system crashes, they will loose the most >> recent >> stuff that they did, > > *some* users may accept that. *None* should. when users are given a choice of having all their work be very slow, or have it be fast, but in the unlikely event of a crash they loose their mose recent changes, they are willing to loose their most recent changes. If you think about it, this is not much different from the fact that you loose all changes since the last time you saved the thing you are working on. Many programs save state periodically so that if the application crashes the user hasn't lost everything, but any application that tried to save after every single change would be so slow that nobody would use it. There is always going to be a window after a user hits 'save' where the data can be lost, because it's not yet on disk. > There are a couple industry failures here: > > 1) the drive manufacturers sell drives that lie, and consumers accept it > because they don't know better. We programmers, who know better, have failed > to raise a stink and demand that this be fixed. > A) Drives should not lose data on power failure. If a drive accepts a write > request and says "OK, done" then that data should get written to stable > storage, period. Whether it requires capacitors or some other onboard power > supply, or whatever, they should just do it. Keep in mind that today, most of > the difference between enterprise drives and consumer desktop drives is just > a firmware change, that hardware is already identical. Nobody should accept a > product that doesn't offer this guarantee. It's inexcusable. This is an option to you. However if you have enabled write caching and reordering, you have explicitly told the system to be faster at the expense of loosing data under some conditions. The fact that you then loose data under those conditions should not surprise you. The idea that you must have enough power to write all the pending data to disk is problematic as that then severely limits the amount of cache that you have. > B) it should go without saying - drives should reliably report back to the > host, when something goes wrong. E.g., if a write request has been accepted, > cached, and reported complete, but then during the actual write an ECC > failure is detected in the cacheline, the drive needs to tell the host "oh by > the way, block XXX didn't actually make it to disk like I told you it did > 10ms ago." The issue isn't a drive having a write error, it's the system shutting down (or crashing) before the data is written, no OS level tricks will help you here. The real problem here isn't the drive claiming the data has been written when it hasn't, the real problem is that the application has said 'write this data' to the OS, and the OS has not done so yet. The OS delays the writes for many legitimate reasons (the disk may be busy, it can get things done more efficently by combining and reordering the writes, etc) Unless the system crashes, this is not a problem, the data will eventually be written out, and on system shutdown everthing is good. But if the system crashes, some of this postphoned work doesn't get done, and that can be a problem. Applications can do fsync if they want to be sure that their data is safe on disk NOW, but they currently have no way of saying "I want to make sure that A happens before B, but I don't care if A happens now or 10 seconds from now" That is the gap that it would be useful to provide a mechanism to deal with, and it doesn't matter what your disk system does in terms of lieing ot not, there still isn't a way to deal with this today. David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-15 12:07 ` David Lang 2012-11-16 15:06 ` Howard Chu @ 2012-11-17 5:02 ` Vladislav Bolkhovitin [not found] ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com> 2 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-17 5:02 UTC (permalink / raw) To: David Lang Cc: Nico Williams, General Discussion of SQLite Database, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel David Lang, on 11/15/2012 07:07 AM wrote: >> There's no such thing as "barrier". It is fully artificial abstraction. After >> all, at the bottom of your stack, you will have to translate it either to cache >> flush, or commands order enforcement, or both. > > When people talk about barriers, they are talking about order enforcement. Not correct. When people are talking about barriers, they are meaning different things. For instance, Alan Cox few e-mails ago was meaning cache flush. That's the problem with the barriers concept: barriers are ambiguous. There's no barrier which can fit all requirements. > the hardware capabilities are not directly accessable from userspace (and they > probably shouldn't be) The discussion is not about to directly provide storage hardware capabilities to the user space. The discussion is to replace fully inadequate barriers abstractions to a set of other, adequate abstractions. For instance: 1. Cache flush primitives: 1.1. FUA 1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile media 1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, possibly before all data hit non-volatile media. 2. ORDERED attribute for requests. It provides the following behavior rules: A. All requests without this attribute can be executed in parallel and be freely reordered. B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED command completed. Those abstractions can naturally fit all storage capabilities. For instance: - On simple WT cache hardware not supporting ordering commands, (1) translates to NOP and (2) to queue draining. - On full features HW, both (1) and (2) translates to the appropriate storage capabilities. On FTL storage (B) can be further optimized by doing data transfers for ORDERED commands in parallel, but commit them in the requested order. > barriers keep getting mentioned because they are a easy concept to understand. Well, concept of flat Earth and Sun rotating around it is also easy to understand. So, why isn't it used? Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>]
* Re: [sqlite] light weight write barriers [not found] ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com> @ 2012-11-17 5:02 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-17 5:02 UTC (permalink / raw) To: 杨苏立 Yang Su Li Cc: General Discussion of SQLite Database, Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel 杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote: > 1. fsync actually does two things at the same time: ordering writes (in a > barrier-like manner), and forcing cached writes to disk. This makes it very > difficult to implement fsync efficiently. Exactly! > However, logically they are two distinctive functionalities Exactly! Those two points are exactly why concept of barriers must be forgotten for sake of productivity and be replaced by a finer grained abstractions as well as why they where removed from the Linux kernel Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-15 1:17 ` Vladislav Bolkhovitin 2012-11-15 12:07 ` David Lang @ 2012-11-15 17:06 ` Ryan Johnson 2012-11-15 22:35 ` Chris Friesen 1 sibling, 1 reply; 58+ messages in thread From: Ryan Johnson @ 2012-11-15 17:06 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: Vladislav Bolkhovitin, Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel, Richard Hipp On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote: > Nico Williams, on 11/13/2012 02:13 PM wrote: >> declaring groups of internally-unordered writes where the groups are >> ordered with respect to each other... is practically the same as >> barriers. > > Which barriers? Barriers meaning cache flush or barriers meaning > commands order, or barriers meaning both? > > There's no such thing as "barrier". It is fully artificial > abstraction. After all, at the bottom of your stack, you will have to > translate it either to cache flush, or commands order enforcement, or > both. Isn't that why we *have* "the stack" in the first place? So apps *don't* have to worry about how the OS implements an artificial (= high-level and portable) abstraction on a given device? > > Are you going to invent 3 types of barriers? One will do, it just needs to be a good one. Maybe I'm missing something here, so I'm going to back up a bit and recap what I understand. The filesystem abstracts the concept of encoding patterns of bits in some physical media (data), and making it easy to find and retrieve those bits later (metadata, incl. file name). When users read(), they expect to see whatever they most recently sent to write(). They also expect that what they write will still be there later, in spite of any failure that leaves the disk itself intact. Operating systems cheat by not actually writing to disk -- for performance reasons -- and users are (mostly, usually) OK with that, because the performance gains are so attractive and things usually work out anyway. Disks cheat too, in the same way and for the same reason. The cheating works great most of the time, but breaks down -- badly -- if we actually care about what is on disk after a crash (or if we use a network filesystem). Enough people do care that fsync() was added to the toolbox. It is defined to transfer "all modified in-core data of the file referred to by the file descriptor fd to the disk device" and "blocks until the device reports that the transfer has completed" (quoting from the fsync(2) man page). Translation: "Stop cheating. Make sure the stuff I already wrote actually got written. And tell the disk to stop cheating, too." Problem is, this definition is asymmetric: it says what happens to writes issued before the fsync, but nothing about those issued after the fsync starts and before it returns [1]. The reader has to assume fsync() makes no promises whatsoever about these later writes: making fsync capture them exposes callers of fsync() to DoS attacks, and them from reaching disk until all outstanding fsync calls complete would add complexity the spec doesn't currently demand, leading to understandable reluctance by kernel devs to code it up. Unfortunately, we're left with the filesystem equivalent of what we in the database world call "eventual consistency" -- easy to implement, nice and fast, but very difficult to write reliable code against unless you're willing to pay the cost of being fully synchronous, all the time. Having tried that for a few years, many people are "returning" to better-specified concurrency models, trading some amount of performance for comfort that the app will at least work predictably when things go wrong in strange and unanticipated ways. The request, then, is to tighten up fsync semantics in two conceptually straightforward ways [2]: First, guarantee that later writes to an fd do not hit disk until earlier calls to fsync() complete. Second, make the call asynchronous. That's all. Note that both changes are necessary. The improved ordering semantic useless by itself, because it's still not safe to request a blocking fsync from one thread and and then let other threads continue issuing writes: there's a race between broadcasting that fsync has begun and issuing the actual syscall that begins it. An asynchronous fsync is also useless by itself, because it only benefits uncoordinated writes (which evidently don't care what data actually reaches disk anyway). The easiest way to implement this fsync would involve three things: 1. Schedule writes for all dirty pages in the fs cache that belong to the affected file, wait for the device to report success, issue a cache flush to the device (or request ordering commands, if available) to make it tell the truth, and wait for the device to report success. AFAIK this already happens, but without taking advantage of any request ordering commands. 2. The requesting thread returns as soon as the kernel has identified all data that will be written back. This is new, but pretty similar to what AIO already does. 3. No write is allowed to enqueue any requests at the device that involve the same file, until all outstanding fsync complete [3]. This is new. The performance hit for #1 can be reduced significantly if the storage hardware at hand happens to support some form of request ordering. The amount of reduction could vary greatly depending on how sophisticated such request ordering is, and how much effort the kernel and/or device driver are willing to work for it. In any case, fsync should already do this [4]. The performance hit for #3 can be minimized by buffering small or otherwise convenient writes in the fs cache and letting the call return immediately, as usual. The corresponding pages just have to be marked in some way to prevent them from being written back too soon. Sequence numbers work well for this sort of thing. Big requests may have to block, but they probably would have anyway, if the buffer cache couldn't absorb them. As with #1, fancy command ordering capabilities in the underlying device just allow additional performance optimizations. A carefully-written app (e.g. free of I/O races) would do pretty well with this extended fsync, certainly far better than the current state of the art allows. Note that this still offers no protection for reads: no matter how many times a thread issues fsync(), it still risks reading non-durable data because reads are not ordered wrt either writes or fsync. That's not the problem we're trying to solve, though. Please feel free to point out where I've gone wrong, but this just doesn't look like as complex or crazy an idea as you make it out to be. [1] Maybe posix.1-1001 is more specific, but it's not publicly available that I could see. [2] I'm fully aware that implementing the request might require significant -- perhaps even unreasonably complex -- changes to the way the kernel currently does things (though I do doubt it). That's not a good excuse to claim the idea itself is unreasonably complex or ill-specified. Just say that it's not a good fit for the current code base. [3] Another concern is whether fsync calls operate on the file or a particular fd. What if a process opens the same file multiple times, or multiple processes have fds pointing to the same file (whether by open or fork)? I would argue for file-level barriers, because it leads to a vastly simpler design (the fs cache doesn't track which process wrote what via what fd). Besides, no app that cares about what ends up on disk will allow uncoordinated writes anyway, so why do extra work just to ensure I/O races stay fast? [4] Really, device support for request ordering commands is a bit of a red herring: the only way it helps significantly is if (a) the storage device has a massive cache compared to the fs cache, (b) it allows I/O scheduling to reduce latency of reads and/or writes (which fsync should do already, and which matters little for flash), and (c) a logging filesystem is not being used (else it's all sequential writes anyway). In other words, it can help performance a bit but has little other impact on what is essentially a software matter. > >> There's a lot to be said for simplicity... as long as the system is >> not so simple as to not work at all. >> >> My p.o.v. is that a filesystem write barrier is effectively the same >> as fsync() with the ability to return sooner (before writes hit stable >> storage) when the filesystem and hardware support on-disk layouts and >> primitives which can be used to order writes preceding and succeeding >> the barrier. > > Your mistake is that you are considering barriers as something real, > which can do something real for you, while it is just a artificial > abstraction apparently invented by people with limited knowledge how > storage works, hence having very foggy vision how barriers supposed to > be processed by it. A simple wrong answer. Storage: Accepts writes and ostensibly makes them available via reads even after power failures. Reorders requests nearly arbitrarily and lies about whether writes actually took effect, unless you issue appropriate cache flushing and/or request ordering commands (and sometimes even then, if it was a cheap consumer drive). OS: Accepts writes and ostensibly makes them available via reads even after power failures, reboots, etc. Reorders requests nearly arbitrarily and lies about whether writes actually took effect, unless you issue a stop-the-world, one-sided write barrier lovingly known as fsync (assuming the actually disk listens when you tell it to stop cheating). Wish: a two-sided write barrier that not only ensures previously-issued writes complete before it reports success, but also prevents later-issued writes from completing while it is in progress, giving a reasonably simple way to enforce some ordering of writes in the system. Can be implemented entirely in software, as the latter has full control over which requests it chooses to schedule at the device, and also decides whether to block the requesting thread or not. Can be made virtually as fast as current writes, by maintaining a little extra information in the fs cache. Please, enlighten me: in what way does my limited knowledge of storage, or my foggy vision of what is desired, make this feature impossible to implement or useless if implemented? > > Generally, you can invent any abstraction convenient for you, but > farther your abstractions from reality of your hardware => less you > will get from it with bigger effort. > > There are no barriers in Linux and not going to be. Accept it. And > start instead thinking about offload capabilities your storage can > offer to you. Apologies if this comes off as flame-bait, but I start to wonder whose abstraction is broken here... What I understand the above to mean is: "Linux file system abstractions are too far from the reality of storage hardware, so it takes lots of effort to accomplish little [in the way of enforcing write ordering]. Accept it. And start thinking instead about talking directly to a storage controller that offers proper write barriers." I hope I misread what you said, because that's a depressing thing to hear from your OS. Ryan ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-15 17:06 ` Ryan Johnson @ 2012-11-15 22:35 ` Chris Friesen 2012-11-17 5:02 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 58+ messages in thread From: Chris Friesen @ 2012-11-15 22:35 UTC (permalink / raw) To: Ryan Johnson Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin, Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel, Richard Hipp On 11/15/2012 11:06 AM, Ryan Johnson wrote: > The easiest way to implement this fsync would involve three things: > 1. Schedule writes for all dirty pages in the fs cache that belong to > the affected file, wait for the device to report success, issue a cache > flush to the device (or request ordering commands, if available) to make > it tell the truth, and wait for the device to report success. AFAIK this > already happens, but without taking advantage of any request ordering > commands. > 2. The requesting thread returns as soon as the kernel has identified > all data that will be written back. This is new, but pretty similar to > what AIO already does. > 3. No write is allowed to enqueue any requests at the device that > involve the same file, until all outstanding fsync complete [3]. This is > new. This sounds interesting as a way to expose some useful semantics to userspace. I assume we'd need to come up with a new syscall or something since it doesn't match the behaviour of posix fsync(). Chris ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-15 22:35 ` Chris Friesen @ 2012-11-17 5:02 ` Vladislav Bolkhovitin 2012-11-20 1:23 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-17 5:02 UTC (permalink / raw) To: Chris Friesen Cc: Ryan Johnson, General Discussion of SQLite Database, Vladislav Bolkhovitin, Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel, Richard Hipp Chris Friesen, on 11/15/2012 05:35 PM wrote: >> The easiest way to implement this fsync would involve three things: >> 1. Schedule writes for all dirty pages in the fs cache that belong to >> the affected file, wait for the device to report success, issue a cache >> flush to the device (or request ordering commands, if available) to make >> it tell the truth, and wait for the device to report success. AFAIK this >> already happens, but without taking advantage of any request ordering >> commands. >> 2. The requesting thread returns as soon as the kernel has identified >> all data that will be written back. This is new, but pretty similar to >> what AIO already does. >> 3. No write is allowed to enqueue any requests at the device that >> involve the same file, until all outstanding fsync complete [3]. This is >> new. > > This sounds interesting as a way to expose some useful semantics to userspace. > > I assume we'd need to come up with a new syscall or something since it doesn't > match the behaviour of posix fsync(). This is how I would export cache sync and requests ordering abstractions to the user space: For async IO (io_submit() and friends) I would extend struct iocb by flags, which would allow to set the required capabilities, i.e. if this request is FUA, or full cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per each iocb. For the regular read()/write() I would add to "flags" parameter of sync_file_range() one more flag: if this sync is immediate or not. To enforce ordering rules I would add one more command to fcntl(). It would make the latest submitted write in this fd ORDERED. All together those should provide the requested functionality in a simple, effective, unambiguous and backward compatible manner. Vlad 1. See my other today's e-mail about what is immediate cache sync. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-17 5:02 ` Vladislav Bolkhovitin @ 2012-11-20 1:23 ` Vladislav Bolkhovitin 2012-11-26 20:05 ` Nico Williams 0 siblings, 1 reply; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-20 1:23 UTC (permalink / raw) To: Chris Friesen Cc: Ryan Johnson, General Discussion of SQLite Database, Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel, Richard Hipp Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote: >>> The easiest way to implement this fsync would involve three things: >>> 1. Schedule writes for all dirty pages in the fs cache that belong to >>> the affected file, wait for the device to report success, issue a cache >>> flush to the device (or request ordering commands, if available) to make >>> it tell the truth, and wait for the device to report success. AFAIK this >>> already happens, but without taking advantage of any request ordering >>> commands. >>> 2. The requesting thread returns as soon as the kernel has identified >>> all data that will be written back. This is new, but pretty similar to >>> what AIO already does. >>> 3. No write is allowed to enqueue any requests at the device that >>> involve the same file, until all outstanding fsync complete [3]. This is >>> new. >> >> This sounds interesting as a way to expose some useful semantics to userspace. >> >> I assume we'd need to come up with a new syscall or something since it doesn't >> match the behaviour of posix fsync(). > > This is how I would export cache sync and requests ordering abstractions to the > user space: > > For async IO (io_submit() and friends) I would extend struct iocb by flags, which > would allow to set the required capabilities, i.e. if this request is FUA, or full > cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per > each iocb. > > For the regular read()/write() I would add to "flags" parameter of > sync_file_range() one more flag: if this sync is immediate or not. > > To enforce ordering rules I would add one more command to fcntl(). It would make > the latest submitted write in this fd ORDERED. Correction. To avoid possible races better that the new fcntl() command would specify that N subsequent read()/write()/sync() calls as ORDERED. For instance, in the simplest case of N=1, one next after fcntl() write() would be handled as ORDERED. (Unfortunately, it doesn't look like this old read()/write() interface has space for a more elegant solution) Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-20 1:23 ` Vladislav Bolkhovitin @ 2012-11-26 20:05 ` Nico Williams 2012-11-29 2:15 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 58+ messages in thread From: Nico Williams @ 2012-11-26 20:05 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Chris Friesen, Ryan Johnson, General Discussion of SQLite Database, linux-fsdevel, Theodore Ts'o, linux-kernel, Richard Hipp Vlad, You keep saying that programmers don't understand "barriers". You've provided no evidence of this. Meanwhile memory barriers are generally well understood, and every programmer I know understands that a "barrier" is a synchronization primitive that says that all operations of a certain type will have completed prior to the barrier returning control to its caller. For some filesystems it is possible to configure fsync() to act as a barrier: for example, ZFS can be told to perform no synchronous operations for a given dataset, in which case fsync() devolves into a simple barrier. (Cue Simon to tell us that some hardware and some OSes, and some filesystems simply cannot implement fsync(), with or without synchronicity.) So just give us a barrier. Yes, I know, it's tricky to implement, but it'd be OK to return EOPNOSUPP, and let the app do something else (e.g., call fsync() instead, tell the user to expect instability, tell the user to get a better system, ...). As for implementation, it helps to have a journalled or log-structured filesystem. It also helps to have hardware synchronization primitives that don't suck, but these aren't entirely necessary: ZFS, for example, can recover [*] from N incomplete transactions[**], and still provides fsync() as a barrier given its on-disk structure and the ZIL. Note that ZFS recovery from incomplete transactions should never be necessary where the HW has proper cache flush support, but the recovery functionality was added precisely because of lousy hardware. [*] At volume import time, such as at boot-time. [**] Granted, this requires user input, but if the user didn't care it could be made automatic. Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-26 20:05 ` Nico Williams @ 2012-11-29 2:15 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-29 2:15 UTC (permalink / raw) To: Nico Williams Cc: Chris Friesen, Ryan Johnson, General Discussion of SQLite Database, linux-fsdevel, Theodore Ts'o, linux-kernel, Richard Hipp Nico Williams, on 11/26/2012 03:05 PM wrote: > Vlad, > > You keep saying that programmers don't understand "barriers". You've > provided no evidence of this. Meanwhile memory barriers are generally > well understood, and every programmer I know understands that a > "barrier" is a synchronization primitive that says that all operations > of a certain type will have completed prior to the barrier returning > control to its caller. Well, your understanding of memory barriers is wrong, and you are illustrating that the memory barriers concept is not so well understood on practice. Simplifying, memory barrier instructions are not "cache flush" of this CPU as it is often thought. They set order how reads or writes from other CPUs are visible on this CPU. And nothing else. Locally on each CPU reads and writes are always seen in order. So, (1) on a single CPU system memory barrier instructions don't make any sense and (2) they should go at least in a pair for each participating in the interaction CPU, otherwise it's an apparent sign of a mistake. There's nothing similar in storage, because storage has strong consistency requirements even if it is distributed. All those clouds and hadoops with weak consistency requirements are outside of this discussion, although even they don't have anything similar to memory barriers. As I already wrote, concept of a flat Earth and Sun revolving around is also very simple to understand. Are you still using this concept? > So just give us a barrier. Similarly to the flat Earth, I'd strongly suggest you to start using adequate concept of what you want to achieve starting from what I proposed few e-mails ago in this thread. If you look at it, it offers exactly what you want, only named correctly. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-13 17:40 ` Alan Cox 2012-11-13 19:13 ` Nico Williams @ 2012-11-15 1:16 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-15 1:16 UTC (permalink / raw) To: Alan Cox Cc: Howard Chu, General Discussion of SQLite Database, Theodore Ts'o, drh, linux-kernel, linux-fsdevel Alan Cox, on 11/13/2012 12:40 PM wrote: >>> Barriers are pretty much universal as you need them for power off ! >> >> I'm afraid, no storage (drives, if you like this term more) at the moment supports >> barriers and, as far as I know the storage history, has never supported. > > The ATA cache flush is a write barrier, and given you have no NV cache > visible to the controller it's the same thing. The cache flush is cache flush. You can call it barrier, if you want to continue confusing yourself and others. >> Instead, what storage does support in this area are: > > Yes - the devil is in the detail once you go beyond simple capabilities. None of those details brings anything not solvable. For instance, I already described in this thread a simple way how requested order of commands can be carried through the stack and implemented that algorithm in SCST. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-11-02 0:38 ` Howard Chu 2012-11-02 12:33 ` Alan Cox @ 2012-11-13 3:37 ` Vladislav Bolkhovitin [not found] ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com> 2 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-13 3:37 UTC (permalink / raw) To: Howard Chu Cc: General Discussion of SQLite Database, Alan Cox, Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel, linux-fsdevel Howard Chu, on 11/01/2012 08:38 PM wrote: > Alan Cox wrote: >>> How about that recently preliminary infrastructure to send ORDERED commands >>> instead of queue draining was deleted from the kernel, because "there's no >>> difference where to drain the queue, on the kernel or the storage side"? >> >> Send patches. > > Isn't any type of kernel-side ordering an exercise in futility, since > a) the kernel has no knowledge of the disk's actual geometry > b) most drives will internally re-order requests anyway > c) cheap drives won't support barriers This is why it is so important for performance to use all storage capabilities. Particularly, ORDERED commands instead of trying to pretend be smarter, than the storage, doing queue draining. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>]
* Re: [sqlite] light weight write barriers [not found] ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com> @ 2012-11-13 3:41 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-13 3:41 UTC (permalink / raw) To: Richard Hipp Cc: General Discussion of SQLite Database, Theodore Ts'o, drh, linux-kernel, linux-fsdevel, Alan Cox Richard Hipp, on 11/02/2012 08:24 AM wrote: > SQLite cares. SQLite is an in-process, transaction, zero-configuration > database that is estimated to be used by over 1 million distinct > applications and to be have over 2 billion deployments. SQLite uses > ordinary disk files in ordinary directories, often selected by the > end-user. There is no system administrator with SQLite, so there is no > opportunity to use a dedicated filesystem with special mount options. > > SQLite uses fsync() as a write barrier to assure consistency following a > power loss. In addition, we do everything we can to maximize the amount of > time after the fsync() before we actually do another write where order > matters, in the hopes that the writes will still be ordered on platforms > where fsync() is ignored for whatever reason. Even so, we believe we could > get a significant performance boost and reliability improvement if we had a > reliable write barrier. I would suggest you to forget word "barrier" for productivity sake. You don't want barriers and confusion they bring. You want instead access to storage accelerated cache sync, commands ordering and atomic attributes/operations. See my other today's e-mail about those. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>]
* Re: [sqlite] light weight write barriers [not found] ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com> @ 2012-11-13 3:42 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 58+ messages in thread From: Vladislav Bolkhovitin @ 2012-11-13 3:42 UTC (permalink / raw) To: 杨苏立 Yang Su Li Cc: Theodore Ts'o, General Discussion of SQLite Database, linux-kernel, linux-fsdevel, Richard Hipp 杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote: >> SATA's Native Command >>> Queuing (NCQ) is not equivalent; this allows the drive to reorder >>> requests (in particular read requests) so they can be serviced more >>> efficiently, but it does *not* allow the OS to specify a partial, >>> relative ordering of requests. >>> >> >> And so? If SATA can't do it, does it mean that nobody else can't do it >> too? I know a plenty of non-SATA devices, which can do the ordering >> requirements you need. >> > > I would be very much interested in what kind of device support this kind of > "topological order", and in what settings they are typically used. > > Does modern flash/SSD (esp. which are used on smartphones) support this? > > If you could point me to some information about this, that would be very > much appreciated. I don't think storage in smartphone can support such advanced functionality, because it tends to be the cheapest, hence the simplest. But many modern Enterprise SAS drives can do it, because for those customers performance is the key requirement. Unfortunately, I'm not sure I can name exact brands and models, because I had my knowledge from NDA'ed docs, so this info can be also NDA'ed. Vlad ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>]
* Re: [sqlite] light weight write barriers [not found] ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com> @ 2012-10-11 16:38 ` Nico Williams 2012-10-11 16:48 ` Nico Williams 0 siblings, 1 reply; 58+ messages in thread From: Nico Williams @ 2012-10-11 16:38 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: Andi Kleen, linux-fsdevel, linux-kernel, drh On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp <drh@sqlite.org> wrote: >> Could you list the requirements of such a light weight barrier? >> i.e. what would it need to do minimally, what's different from >> fsync/fdatasync ? > > For SQLite, the write barrier needs to involve two separate inodes. The > requirement is this: ... > Note also that when fsync() works as advertised, SQLite transactions are > ACID. But when fsync() is reduced to a write-barrier, we loss the D > (durable) and transactions are only ACI. In our experience, nobody really > cares very much about durable across a power-loss. People are mainly > interested in Atomic, Consistent, and Isolated. If you take a power loss > and then after reboot you find the 10 seconds of work prior to the power > loss is missing, nobody much cares about that as long as all of the prior > work is still present and consistent. There is something you can do: use a combination of COW on-disk formats in such a way that it's possible to detect partially-committed transactions and rollback to the last good known root, and backgrounded fsync()s (i.e., in a separate thread, without waiting for the fsync() to complete). Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [sqlite] light weight write barriers 2012-10-11 16:38 ` Nico Williams @ 2012-10-11 16:48 ` Nico Williams 0 siblings, 0 replies; 58+ messages in thread From: Nico Williams @ 2012-10-11 16:48 UTC (permalink / raw) To: General Discussion of SQLite Database Cc: Andi Kleen, linux-fsdevel, linux-kernel, drh To expand a bit, the on-disk format needs to allow the roots of N of the last transactions to be/remain reachable at all times. At open time you look for the latest transaction, verify that it has been written[0] completely, then use it, else look for the preceding transaction, verify it, and so on. N needs to be at least 2: the last and the preceding transactions. No blocks should be freed or reused for any transactions still in use or possible use (e.g., for power failure recovery). For high read concurrency you can allow connections to lock a past transaction so that no blocks are freed that are needed to access the DB at that state. This all goes back to 1980s DB and filesystem concepts. See, for example, the BSD4.4 Log Structure Filesystem. (I mention this in case there are concerns about patents, though IANAL and I make no particular assertions here other than that there is plenty of old prior art and expired patents that can probably be used to obtain sufficient certainty as to the patent law risks in the approach described herein.) [0] E.g., check a transaction block manifest and check that those blocks were written correctly; or traverse the tree looking for differences to the previous transaction; this may require checking block contents checksums. Nico -- ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2012-11-29 2:15 UTC | newest] Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CALwJ=MzHjAOs4J4kGH6HLdwP8E88StDWyAPVumNg9zCWpS9Tdg@mail.gmail.com> 2012-10-10 17:17 ` light weight write barriers Andi Kleen 2012-10-11 16:32 ` [sqlite] " 杨苏立 Yang Su Li 2012-10-11 17:41 ` Christoph Hellwig 2012-10-23 19:53 ` Vladislav Bolkhovitin 2012-10-24 21:17 ` Nico Williams 2012-10-24 22:03 ` david 2012-10-25 0:20 ` Nico Williams 2012-10-25 1:04 ` david 2012-10-25 5:18 ` Nico Williams 2012-10-25 6:02 ` Theodore Ts'o 2012-10-25 6:58 ` david 2012-10-25 14:03 ` Theodore Ts'o 2012-10-25 18:03 ` david 2012-10-25 18:29 ` Theodore Ts'o 2012-11-05 20:03 ` Pavel Machek 2012-11-05 22:04 ` Theodore Ts'o [not found] ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com> 2012-11-05 23:00 ` Theodore Ts'o 2012-10-30 23:49 ` Nico Williams 2012-10-25 5:42 ` Theodore Ts'o 2012-10-25 7:11 ` david 2012-10-27 1:52 ` Vladislav Bolkhovitin 2012-10-25 5:14 ` Theodore Ts'o 2012-10-25 13:03 ` Alan Cox 2012-10-25 13:50 ` Theodore Ts'o 2012-10-27 1:55 ` Vladislav Bolkhovitin 2012-10-27 1:54 ` Vladislav Bolkhovitin 2012-10-27 4:44 ` Theodore Ts'o 2012-10-30 22:22 ` Vladislav Bolkhovitin 2012-10-31 9:54 ` Alan Cox 2012-11-01 20:18 ` Vladislav Bolkhovitin 2012-11-01 21:24 ` Alan Cox 2012-11-02 0:15 ` Vladislav Bolkhovitin 2012-11-02 0:38 ` Howard Chu 2012-11-02 12:33 ` Alan Cox 2012-11-13 3:41 ` Vladislav Bolkhovitin 2012-11-13 17:40 ` Alan Cox 2012-11-13 19:13 ` Nico Williams 2012-11-15 1:17 ` Vladislav Bolkhovitin 2012-11-15 12:07 ` David Lang 2012-11-16 15:06 ` Howard Chu 2012-11-16 15:31 ` Ric Wheeler 2012-11-16 15:54 ` Howard Chu 2012-11-16 18:03 ` Ric Wheeler 2012-11-16 19:14 ` David Lang 2012-11-17 5:02 ` Vladislav Bolkhovitin [not found] ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com> 2012-11-17 5:02 ` Vladislav Bolkhovitin 2012-11-15 17:06 ` Ryan Johnson 2012-11-15 22:35 ` Chris Friesen 2012-11-17 5:02 ` Vladislav Bolkhovitin 2012-11-20 1:23 ` Vladislav Bolkhovitin 2012-11-26 20:05 ` Nico Williams 2012-11-29 2:15 ` Vladislav Bolkhovitin 2012-11-15 1:16 ` Vladislav Bolkhovitin 2012-11-13 3:37 ` Vladislav Bolkhovitin [not found] ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com> 2012-11-13 3:41 ` Vladislav Bolkhovitin [not found] ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com> 2012-11-13 3:42 ` Vladislav Bolkhovitin [not found] ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com> 2012-10-11 16:38 ` Nico Williams 2012-10-11 16:48 ` Nico Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).