On Tue, Jan 07, 2020 at 11:44:43AM -0700, Chris Murphy wrote: > On Fri, Jan 3, 2020 at 10:38 PM Zygo Blaxell > wrote: > > > > On Thu, Jan 02, 2020 at 04:22:37PM -0700, Chris Murphy wrote: > > > > I've seen with 16KiB leaf size, often small files that could be > > > inlined, are instead put into a data block group, taking up a minimum > > > 4KiB block size (on x64_64 anyway). I'm not sure why, but I suspect > > > there just isn't enough room in that leaf to always use inline > > > extents, and yet there is enough room to just reference it as a data > > > block group extent. When using a larger node size, a larger percentage > > > of small files ended up using inline extents. I'd expect this to be > > > quite a bit more efficient, because it eliminates a time expensive (on > > > HDD anyway) seek. > > > > Putting a lot of inline file data into metadata pages makes them less > > dense, which is either good or bad depending on which bottleneck you're > > currently hitting. If you have snapshots there is an up-to-300x metadata > > write amplification penalty to update extent item references every time > > a shared metadata page is unshared. Inline extents reduce the write > > amplification. On the other hand, if you are doing a lot of 'find'-style > > tree sweeps, then inline extents will reduce their efficiency because more > > pages will have to be read to scan the same number of dirents and inodes. > > Egads! Soo... total tangent. I'll change the subject. > > I have had multiple flash drive failures while using Btrfs: all > Samsung, several SD Cards, and so far two USB sticks. They all fail in > the essentially the same way, the media itself becomes read only. USB: > writes succeed but they do not persist. Write data to the media, and > there is no error. Read that same sector back, old data is there. SD > Card: writes fail with a call trace and diagnostic info unique to the > sd card kernel code, and everything just goes belly up. This happens > inside of 6 months of rather casual use as rootfs. And BTW Samsung > always replaces the media under warranty without complaint. > > It's not a scientific sample. Could be the host device, which is the > same in each case. Could be a bug in the firmware. I have nothing to > go on really. It seems to be normal behavior for USB sticks and SD cards. I've also had USB sticks degrade (bit errors) simply from sitting unused on a shelf for six months. Some low-end SATA SSDs (like $20/TB drives from Amazon) are giant USB sticks with a SATA interface, and will fail the same way. SD card vendors are starting to notice, and there are now SD card options with higher endurance ratings. Still "putting this card in a dashcam voids the warranty" in most cases. ext2 and msdos both make USB sticks last longer, but they have obvious other problems. From my fleet of raspberry pis, I find that SD cards last longer on btrfs than ext4 with comparable write loads, but they are still both lead very short lives, and the biggest life expectancy improvement (up to a couple of years) comes from eliminating local writes entirely. > But I wonder if this is due to write amplification that's just not > anticipated by the manufacturers? Is there any way to test for this or > estimate the amount of amplification? This class of media doesn't > report LBA's written, so I'm at quite a lack of useful information to > know what the cause is. The relevance here though is, I really like > the idea of Btrfs used as a rootfs for things like IoT because of > compression, ostensibly there are ssd optimizations, and always on > checksumming to catch what often can be questionable media: like USB > sticks, SD Cards, eMMC, etc. But not if the write amplication has a > good chance of killing people's hardware (I have no proof of this but > now I wonder, as I read your email). > > I'm aware of write amplification, I just didn't realize it could be > this massive. It's is 300x just by having snapshots at all? Or does it > get worse with each additional snapshot? And is it multiplicative or > exponentially worse? A 16K subvol metadata page can hold ~300 extent references. Each extent reference is bidirectional--there is a reference from the subvol metadata page to the extent data item, and a backref from the extent data item to the reference. If a new reference is created via clone or dedupe or partially overwriting an extent in the middle, then the extent item's reference count is increased, and new backrefs are added to the extent item's page. When a snapshot is created, all the metadata pages except the root become shared. Each referenced extent data item is not changed at this time, as there is only one metadata page containing references to each extent data item. The metadata page carrying the extent reference item has multiple owners which are ancestor nodes in all of the snapshot subvol trees. The backref walking code starts from an extent data item, and follows references back to subvol metadata pages. If the subvol metadata pages are also shared, then the walking code follows those back to the subvol roots. The true reference count for an extent is a combination of the direct references (subvol leaf page to extent data) and indirect references (subvol root or node page to subvol leaf page). When a snapshot metadata page is modified, a new metadata page is created with mostly the same content, give or take the items that are added or modified. This inserts ~300 new extent data backref items into the extent tree because they are now owned by both the old and new versions of the metadata page. It is as if the files located in the subvol metadata page were silently reflinked from the old to new subvol, but only in the specific areas listed on the single modified metadata page. In the worst case, all ~300 extent data items are stored on separate extent tree pages (i.e. you have really bad fragmentation on a big filesystem, and all of the extents in a file are in different places on disk). In this case, to modify a single page of shared subvol metadata, we must also update up to 300 extent tree pages. This is where the 300x write multiply comes from. It's not really the worst case--each of those 300 page updates has their own multiplication (e.g. the extent tree page may be overfilled and split, doubling one write). If you end up freeing pages all over the filesystem, there will be free space cache/tree modifications as well. In the best case, the metadata page isn't shared (i.e. it was CoWed or it's a brand new metadata page). In this case there's no need to update backref pointers or reference counts for unmodified items, as they will be deleted during the same transaction that creates a copy of them. Real cases fall between 1x and 300x. The first time you modify a metadata page in a snapshotted subvol, you must also update ~300 extent data item backrefs (unless the subvol is so small it contains fewer than 300 items). This repeats for every shared page as it is unshared. Over time, shared pages are replaced with unshared pages, and performance within the subvol returns to normal levels. If a new snapshot is created for the subvol, we start over at 300x item updates since every metadata page in the subvol is now shared again. The write multiplication contribution from snapshots quickly drops to 1x, but it's worst after a new snapshot is created (for either the old or new subvol). On a big filesystem it can take minutes to create a new directory, rename, delete, or hardlink files for the first time after a snapshot is created, as even the most trivial metadata update becomes an update of tens of thousands of randomly scattered extent tree pages. Snapshots and 'cp -ax --reflink=always a b' are comparably expensive if you modify more than 0.3% of the metadata and the modifications are equally distributed across the subvol. In the end, you'll have completely unshared metadata and touched every page of the extent tree either way, but the timing will be different (e.g. mkdirs and renames will be blamed for the low performance, when the root cause is really the snapshot for backups that happened 12 hours earlier). > In the most prolific snapshotting case, I had two subvolumes, each > with 20 snapshots (at most). I used default ssd mount option for the > sdcards, most recently ssd_spread with the usb sticks. And now nossd > with the most recent USB stick I just started to use. The number of snapshots doesn't really matter: you get the up-to-300x write multiple from writing to a subvol that has shared metadata pages, which happens when you have just one snapshot. It doesn't matter if you have 1 snapshot or 10000 (at least, not for _this_ reason). > -- > Chris Murphy >