On Jun 17, 2022, at 5:56 PM, Santosh S wrote: > > On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o wrote: >> >> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote: >>> Dear ext4 developers, >>> >>> This is my test - preallocate a large file (2G) and then do sequential >>> 4K direct-io writes to that file, with fdatasync after every write. >>> I am preallocating using fallocate mode 0. I noticed that if the 2G >>> file is pre-written rather than fallocate'd I get more than twice the >>> throughput. I could reproduce this with fio. The storage is nvme. >>> Kernel version is 5.3.18 on Suse. >>> >>> Am I doing something wrong or is this difference expected? Any >>> suggestion to get a better throughput without actually pre-writing the >>> file. >> >> This is, alas, expected. The reason for this is because when you use >> fallocate, the extent is marked as uninitialized, so that when you >> read from the those newly allocated blocks, you don't see previously >> written data belonging to deleted files. These files could contain >> someone else's e-mail, or medical information, etc. So if we didn't >> do this, it would be a walking, talking HIPPA or PCI violation. >> >> So when you write to an fallocated region, and then call fdatasync(2), >> we need to update the metadata blocks to clear the uninitialized bit >> so that when you read from the file after a crash, you actually get >> the data that was written. So the fdatasync(2) operation is quite the >> heavyweight operation, since it requries journal commit because of the >> required metadata update. When you do an overwrite, there is no need >> to force a metadata update and journal update, which is why write(2) >> plus fdatasync(2) is much lighter weight when you do an overwrite. >> >> What enterprise databases (e.g., Oracle Enterprise Database and IBM's >> Informix DB) tend to do is to use fallocate a chunk of space (say, >> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some >> file system's block allocators to be more likely to allocate a >> contiguous block range, and then immediate write zero's on that 16 or >> 32MB, plus a fdatasync(2). This fdatasync(2) would update the extent >> tree once to make that 16MB or 32MB to be marked initialized to the >> database's tablespace file, so you only pay the metadata update once, >> instead of every few dozen kilobytes as you write each database commit >> into the tablespace file. >> >> There is also an old, out of tree patch which enables an fallocate >> mode called "no hide stale", which marks the extent tree blcoks which >> are allocated using fallocate(2) as initialized. This substantially >> speeds things up, but it is potentially a walking, talking, HIPPA or >> PCI violation in that revealing previously written data is considered >> a horrible security violation by most file system developers. >> >> If you know, say, that a cluster file system is the only user of the >> file system, and all data is written encrypted at rest using a >> per-user key, such that exposing stale data is not a security >> disaster, the "no hide stale" flag could be "safe" in that highly >> specialized user case. >> >> But that assumes that file system authors can trust application >> writers not to do something stupid and insecure, and historically, >> file system authors (possibly with good reason, given bitter past >> experience) don't trust application writesr to do something which is >> very easy, and gooses performance, even if it has terrible side >> effects on either data robustness or data security. >> >> Effectively, the no hide stale flag could be considered an "Attractive >> Nuisance"[1] and so support for this feature has never been accepted >> into the mainline kernel, and never to any distro kernels, since the >> distribution companies don't want to be held liable for making an >> "acctive nuisance" that might enable application authors from shooting >> themselves in the foot. >> >> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine >> >> In any case, the technique of fallocatE(2) plus zero-fill-write plus >> fdatasync(2) isn't *that* slow, and is only needed when you are first >> extending the tablespace file. In the steady state, most database >> applications tend to be overwriting space, so this isn't an issue. >> >> In any case, if you need to get that last 5% or so of performance --- >> say, if you are are an enterprise database company interested in >> taking a full page advertisement on the back cover of Business Week >> Magazine touting how your enterprise database benchmarks are better >> than the competition --- the simple solution is to use a raw block >> device. Of course, most end users want the convenience of the file >> system, but that's not the point if you are engaging in >> benchmarketing. :-) >> >> Cheers, >> >> - Ted > > Thank you for a comprehensive answer :-) > > I have one more question - when I gradually increase the i/o transfer > size the performance degradation begins to lessen and at 32K it is > similar to the "overwriting the file" case. I assume this is because > the metadata update is now spread over 32K of data rather than 4K. When splitting unwritten extents, the ext4 code will write out zero blocks up to 32KB by default (/sys/fs/ext4/*/extent_max_zeroout_kb) to avoid having millions of very small extents in a file (e.g. in case of a pathological alternating 4KB write pattern). If your test is writing >= 32KB blocks then this no longer needs to be done. If writing smaller blocks then it makes sense that the speed is 1/2 the raw speed because the file blocks are all being written twice (first with zeroes, then with actual data on a later write). 32KB (or 64KB) is a reasonable minimum size because any disk write will take the same time to write a single block or a whole sector, so doing writes in smaller units is not very efficient. Depending on the underlying storage (e.g. RAID-6) it might be more efficient to set extent_max_zeroout_kb=1024 or similar. > However, my understanding is that, in my case, an extent should > represent the max 128MiB of data and so the clearing of the > uninitialized bit for an extent should happen once every 128MiB, so > then why is a higher transfer size making a difference? You are misunderstanding how uninitialized extents are cleared. The uninitialized extent is split into two/three parts, where only the extent that has data written to it (min 32KB) is set to "initialized" and the remaining one/two extents are left uninitialized. Otherwise, each write to an uninitialized extent would need up to 128MB of zeroes written to disk each time, which would be slow/high latency. Cheers, Andreas