From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Reinecke Subject: Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages Date: Mon, 29 Feb 2016 11:06:32 +0800 Message-ID: <56D3B5B8.5030601@suse.de> References: <05353ADC-601C-412D-80E2-1F1972324A37@hgst.com> <56CBD878.4070600@sandisk.com> <20F4EEF6-8F58-42A0-99E9-E932B51D0387@hgst.com> <20160223084042.GA24086@quack.suse.cz> <4E5FED9C-0FE5-4C31-B3C2-7C6225AB8C62@hgst.com> <20160224084746.GD10096@quack.suse.cz> <1A7F5162-E47D-4E95-8004-8100C6CA231D@hgst.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mx2.suse.de ([195.135.220.15]:50241 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753140AbcB2DHI (ORCPT ); Sun, 28 Feb 2016 22:07:08 -0500 In-Reply-To: <1A7F5162-E47D-4E95-8004-8100C6CA231D@hgst.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Damien Le Moal , Jan Kara Cc: Bart Van Assche , "lsf-pc@lists.linuxfoundation.org" , "linux-block@vger.kernel.org" , Matias Bjorling , "linux-scsi@vger.kernel.org" On 02/29/2016 10:02 AM, Damien Le Moal wrote: >=20 >> On Wed 24-02-16 01:53:24, Damien Le Moal wrote: >>> >>>> On Tue 23-02-16 05:31:13, Damien Le Moal wrote: >>>>> >>>>>> On 02/22/16 18:56, Damien Le Moal wrote: >>>>>>> 2) Write back of dirty pages to SMR block devices: >>>>>>> >>>>>>> Dirty pages of a block device inode are currently processed usi= ng the >>>>>>> generic_writepages function, which can be executed simultaneous= ly >>>>>>> by multiple contexts (e.g sync, fsync, msync, sync_file_range, = etc). >>>>>>> Mutual exclusion of the dirty page processing being achieved on= ly at >>>>>>> the page level (page lock & page writeback flag), multiple proc= esses >>>>>>> executing a "sync" of overlapping block ranges over the same zo= ne of >>>>>>> an SMR disk can cause an out-of-LBA-order sequence of write req= uests >>>>>>> being sent to the underlying device. On a host managed SMR disk= , where >>>>>>> sequential write to disk zones is mandatory, this result in err= ors and >>>>>>> the impossibility for an application using raw sequential disk = write >>>>>>> accesses to be guaranteed successful completion of its write or= fsync >>>>>>> requests. >>>>>>> >>>>>>> Using the zone information attached to the SMR block device que= ue >>>>>>> (introduced by Hannes), calls to the generic_writepages functio= n can >>>>>>> be made mutually exclusive on a per zone basis by locking the z= ones. >>>>>>> This guarantees sequential request generation for each zone and= avoid >>>>>>> write errors without any modification to the generic code imple= menting >>>>>>> generic_writepages. >>>>>>> >>>>>>> This is but one possible solution for supporting SMR host-manag= ed >>>>>>> devices without any major rewrite of page cache management and >>>>>>> write-back processing. The opinion of the audience regarding th= is >>>>>>> solution and discussing other potential solutions would be grea= tly >>>>>>> appreciated. >>>>>> >>>>>> Hello Damien, >>>>>> >>>>>> Is it sufficient to support filesystems like BTRFS on top of SMR= drives=20 >>>>>> or would you also like to see that filesystems like ext4 can use= SMR=20 >>>>>> drives ? In the latter case: the behavior of SMR drives differs = so=20 >>>>>> significantly from that of other block devices that I'm not sure= that we=20 >>>>>> should try to support these directly from infrastructure like th= e page=20 >>>>>> cache. If we look e.g. at NAND SSDs then we see that the charact= eristics=20 >>>>>> of NAND do not match what filesystems expect (e.g. large erase b= locks).=20 >>>>>> That is why every SSD vendor provides an FTL (Flash Translation = Layer),=20 >>>>>> either inside the SSD or as a separate software driver. An FTL=20 >>>>>> implements a so-called LFS (log-structured filesystem). With wha= t I know=20 >>>>>> about SMR this technology looks also suitable for implementation= of a=20 >>>>>> LFS. Has it already been considered to implement an LFS driver f= or SMR=20 >>>>>> drives ? That would make it possible for any filesystem to acces= s an SMR=20 >>>>>> drive as any other block device. I'm not sure of this but maybe = it will=20 >>>>>> be possible to share some infrastructure with the LightNVM drive= r=20 >>>>>> (directory drivers/lightnvm in the Linux kernel tree). This driv= er=20 >>>>>> namely implements an FTL. >>>>> >>>>> I totally agree with you that trying to support SMR disks by only= modifying >>>>> the page cache so that unmodified standard file systems like BTRF= S or ext4 >>>>> remain operational is not realistic at best, and more likely simp= ly impossible. >>>>> For this kind of use case, as you said, an FTL or a device mapper= driver are >>>>> much more suitable. >>>>> >>>>> The case I am considering for this discussion is for raw block de= vice accesses >>>>> by an application (writes from user space to /dev/sdxx). This is = a very likely >>>>> use case scenario for high capacity SMR disks with applications l= ike distributed >>>>> object stores / key value stores. >>>>> >>>>> In this case, write-back of dirty pages in the block device file = inode mapping >>>>> is handled in fs/block_dev.c using the generic helper function ge= neric_writepages. >>>>> This does not guarantee the generation of the required sequential= write pattern >>>>> per zone necessary for host-managed disks. As I explained, aligni= ng calls of this >>>>> function to zone boundaries while locking the zones under write-b= ack solves >>>>> simply the problem (implemented and tested). This is of course on= ly one possible >>>>> solution. Pushing modifications deeper in the code or providing a >>>>> "generic_sequential_writepages" helper function are other potenti= al solutions >>>>> that in my opinion are worth discussing as other types of devices= may benefit also >>>>> in terms of performance (e.g. regular disk drives prefer sequenti= al writes, and >>>>> SSDs as well) and/or lighten the overhead on an underlying FTL or= device mapper >>>>> driver. >>>>> >>>>> For a file system, an SMR compliant implementation of a file inod= e mapping >>>>> writepages method should be provided by the file system itself as= the sequentiality >>>>> of the write pattern depends further on the block allocation mech= anism of the file >>>>> system. >>>>> >>>>> Note that the goal here is not to hide to applications the sequen= tial write >>>>> constraint of SMR disks. The page cache itself (the mapping of th= e block >>>>> device inode) remains unchanged. But the modification proposed gu= arantees that >>>>> a well behaved application writing sequentially to zones through = the page cache >>>>> will see successful sync operations. >>>> >>>> So the easiest solution for the OS, when the application is alread= y aware >>>> of the storage constraints, would be for an application to use dir= ect IO. >>>> Because when using page-cache and writeback there are all sorts of >>>> unexpected things that can happen (e.g. writeback decides to skip = a page >>>> because someone else locked it temporarily). So it will work in 99= =2E9% of >>>> cases but sometimes things will be out of order for hard-to-track = down >>>> reasons. And for ordinary drives this is not an issue because we j= ust slow >>>> down writeback a bit but rareness of this makes it non-issue. But = for host >>>> managed SMR the IO fails and that is something the application doe= s not >>>> expect. >>>> >>>> So I would really say just avoid using page-cache when you are usi= ng SMR >>>> drives directly without a translation layer. For writes your throu= ghput >>>> won't suffer anyway since you have to do big sequential writes. Us= ing >>>> page-cache for reads may still be beneficial and if you are carefu= l enough >>>> not to do direct IO writes to the same range as you do buffered re= ads, this >>>> will work fine. >>>> >>>> Thinking some more - if you want to make it foolproof, you could i= mplement >>>> something like read-only page cache for block devices. Any write w= ill be in >>>> fact direct IO write, writeable mmaps will be disallowed, reads wi= ll honor >>>> O_DIRECT flag. >>> >>> Hi Jan, >>> >>> Indeed, using O_DIRECT for raw block device write is an obvious sol= ution to >>> guarantee the application successful sequential writes within a zon= e. However, >>> host-managed SMR disks (and to a lesser extent host-aware drives to= o) already >>> put on applications the constraint of ensuring sequential writes. A= dding to this >>> further mandatory rewrite to support direct I/Os is in my opinion a= sking a lot, >>> if not too much. >> >> So I don't think adding O_DIRECT to open flags is such a burden - >> sequential writes are IMO much harder to do :). And furthermore this= could >> happen magically inside the kernel in which case app needn't be awar= e about >> this at all (similarly to how we handle writes to persistent memory)= =2E >> >>> The example you mention above of writeback skipping a locked page a= nd resulting >>> in I/O errors is precisely what the proposed patch avoids by first = locking the >>> zone the page belongs to. In the same spirit as the writeback page = locking, if >>> the zone is already locked, it is skipped. That is, zones are treat= ed in a sense >>> as gigantic pages, ensuring that the actual dirty pages within each= one are >>> processed in one go, sequentially. >> >> But you cannot rule out mm subsystem locking a page to do something = (e.g. >> migrate the page to help with compaction of large order pages). Thes= e other >> places accessing and locking pages are what I'm worried about. Furth= ermore >> kswapd can decide to writeback particular page under memory pressure= and >> that will just make SMR disk freak out. >> >>> This allows preserving all possible application level accesses (buf= fered, >>> direct or mmapped). The only constraint is the one the disk imposes= : >>> writes must be sequential. >>> >>> Granted, this view may be too simplistic and may be overlooking som= e hard >>> to track page locking paths which will compete with this. But I thi= nk >>> that this can be easily solved by forcing the zone-aligned >>> generic_writepages calls to not skip any page (a flag in struct >>> writeback_control would do the trick). And no modification is neces= sary >>> on the read side (i.e. page locking only is enough) since reading a= n SMR >>> disks blocks after a zone write-pointer position does not make sens= e (in >>> Hannes code, this is possible, but the request does not go to the d= isk >>> and returns garbage data). >>> >>> Bottom line: no fundamental change to the page caching mechanism, o= nly >>> how it is being used/controlled for writeback makes this work. >>> Considering the benefits on the application side, it is in my opini= on a >>> valid modification to have. >> >> See above, there are quite a few places which will break your assump= tions. >> And I don't think changing them all to handle SMR is worth it. IMO c= aching >> sequential writes to SMR disks has low effect (if any) anyway so I w= ould >> just avoid that. We can talk about how to make this as seamless to >> applications as possible. The only thing which I don't think is reas= onably >> doable without dirtying pagecache are writeable mmaps of an SMR devi= ce so >> applications would have to avoid that. >=20 > Jan, >=20 > Thank you for your insight. > These "few places" breaking sequential write sequences are indeed pro= blematic > for SMR drives. At the same time, I wonder how these paths would reac= t to an I/O > error generated by the check "write at write pointer" in the request = submission > path at the SCSI level. Could these be ignored in the case of an "una= ligned write > error" ? That is, the page is left dirty and hopefully the regular wr= iteback path > catches them later in the proper sequence. This may however be danger= ous as there > is no way to determine if the unaligned error is due to kswapd or oth= er kernel > threads trying to write back the "wrong" page, or the application hav= ing submitted > an out of sequence write. >=20 > Until now, the discussion has focused on avoiding unaligned write err= ors for cached > writes. But this happens only on host-managed SMR disks. Another aspe= ct of the SMR > support should also be to avoid random write to zones on host-aware d= isks. These will > not return an error on unaligned writes and silently process them as = a regular disk. > However, this can over time degrade performance as the disk FW has to= handle more and > more internal zone defragmentation. >=20 To chime in here, we _might_ be able to fix this via a totally differen= t route. If we were allow to pass _linked_ bios to ->make_request_fn (ie bios where the ->bi_next field was already populated) we would have an easy marker for merging those requests. At the same time we would be able to process these linked bios as a single unit, allowing other bios only to be added to the front or the back of these linked bios. That would guarantee in-order delivery for SMR, and at the same time allow us to get merging running for block-mq. Alternatively one could try to use plugging here, but I'm not sure if that would be sufficient; will need to test. > If possible, I look forward to more discussions about this at LSF/MM. >=20 Same here. Btw, I do like the idea of Online logical head depop. No idea how we could implement that, but the idea is nice. Cheers, Hannes --=20 Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg GF: J. Hawn, J. Guild, F. Imend=C3=B6rffer, HRB 16746 (AG N=C3=BCrnberg= ) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html