From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Nicholas A. Bellinger" Subject: Re: [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion Date: Thu, 08 Jan 2015 14:57:02 -0800 Message-ID: <1420757822.2842.39.camel@haakon3.risingtidesystems.com> References: <54AD5DDD.2090808@dev.mellanox.co.il> <54AD6563.4040603@suse.de> <54ADA777.6090801@cs.wisc.edu> <54AE36CE.8020509@acm.org> <1420755361.2842.16.camel@haakon3.risingtidesystems.com> <1420756142.11310.9.camel@HansenPartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1420756142.11310.9.camel@HansenPartnership.com> Sender: target-devel-owner@vger.kernel.org To: James Bottomley Cc: Bart Van Assche , open-iscsi@googlegroups.com, Hannes Reinecke , Sagi Grimberg , lsf-pc@lists.linux-foundation.org, linux-scsi , target-devel List-Id: linux-scsi@vger.kernel.org On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote: > On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote: > > On Thu, 2015-01-08 at 08:50 +0100, Bart Van Assche wrote: > > > On 01/07/15 22:39, Mike Christie wrote: > > > > On 01/07/2015 10:57 AM, Hannes Reinecke wrote: > > > >> On 01/07/2015 05:25 PM, Sagi Grimberg wrote: > > > >>> Hi everyone, > > > >>> > > > >>> Now that scsi-mq is fully included, we need an iSCSI initiato= r that > > > >>> would use it to achieve scalable performance. The need is eve= n greater > > > >>> for iSCSI offload devices and transports that support multipl= e HW > > > >>> queues. As iSER maintainer I'd like to discuss the way we wou= ld choose > > > >>> to implement that in iSCSI. > > > >>> > > > >>> My measurements show that iSER initiator can scale up to ~2.1= M IOPs > > > >>> with multiple sessions but only ~630K IOPs with a single sess= ion where > > > >>> the most significant bottleneck the (single) core processing > > > >>> completions. > > > >>> > > > >>> In the existing single connection per session model, given th= at command > > > >>> ordering must be preserved session-wide, we end up in a seria= l command > > > >>> execution over a single connection which is basically a singl= e queue > > > >>> model. The best fit seems to be plugging iSCSI MCS as a multi= -queued > > > >>> scsi LLDD. In this model, a hardware context will have a 1x1 = mapping > > > >>> with an iSCSI connection (TCP socket or a HW queue). > > > >>> > > > >>> iSCSI MCS and it's role in the presence of dm-multipath layer= was > > > >>> discussed several times in the past decade(s). The basic need= for MCS is > > > >>> implementing a multi-queue data path, so perhaps we may want = to avoid > > > >>> doing any type link aggregation or load balancing to not over= lap > > > >>> dm-multipath. For example we can implement ERL=3D0 (which is = basically the > > > >>> scsi-mq ERL) and/or restrict a session to a single portal. > > > >>> > > > >>> As I see it, the todo's are: > > > >>> 1. Getting MCS to work (kernel + user-space) with ERL=3D0 and= a > > > >>> round-robin connection selection (per scsi command execut= ion). > > > >>> 2. Plug into scsi-mq - exposing num_connections as nr_hw_queu= es and > > > >>> using blk-mq based queue (conn) selection. > > > >>> 3. Rework iSCSI core locking scheme to avoid session-wide loc= king > > > >>> as much as possible. > > > >>> 4. Use blk-mq pre-allocation and tagging facilities. > > > >>> > > > >>> I've recently started looking into this. I would like the com= munity to > > > >>> agree (or debate) on this scheme and also talk about implemen= tation > > > >>> with anyone who is also interested in this. > > > >>> > > > >> Yes, that's a really good topic. > > > >> > > > >> I've pondered implementing MC/S for iscsi/TCP but then I've fi= gured my > > > >> network implementation knowledge doesn't spread that far. > > > >> So yeah, a discussion here would be good. > > > >> > > > >> Mike? Any comments? > > > > > > > > I have been working under the assumption that people would be o= k with > > > > MCS upstream if we are only using it to handle the issue where = we want > > > > to do something like have a tcp/iscsi connection per CPU then m= ap the > > > > connection to a blk_mq_hw_ctx. In this more limited MCS impleme= ntation > > > > there would be no iscsi layer code to do something like load ba= lance > > > > across ports or transport paths like how dm-multipath does, so = there > > > > would be no feature/code duplication. For balancing across hctx= s, then > > > > the iscsi layer would also leave that up to whatever we end up = with in > > > > upper layers, so again no feature/code duplication with upper l= ayers. > > > > > > > > So pretty non controversial I hope :) > > > > > > > > If people want to add something like round robin connection sel= ection in > > > > the iscsi layer, then I think we want to leave that for after t= he > > > > initial merge, so people can argue about that separately. > > >=20 > > > Hello Sagi and Mike, > > >=20 > > > I agree with Sagi that adding scsi-mq support in the iSER initiat= or=20 > > > would help iSER users because that would allow these users to con= figure=20 > > > a single iSER target and use the multiqueue feature instead of ha= ving to=20 > > > configure multiple iSER targets to spread the workload over multi= ple=20 > > > cpus at the target side. > > >=20 > > > And I agree with Mike that implementing scsi-mq support in the iS= ER=20 > > > initiator as multiple independent connections probably is a bette= r=20 > > > choice than MC/S. RFC 3720 namely requires that iSCSI numbering i= s=20 > > > session-wide. This means maintaining a single counter for all MC/= S=20 > > > sessions. Such a counter would be a contention point. I'm afraid = that=20 > > > because of that counter performance on a multi-socket initiator s= ystem=20 > > > with a scsi-mq implementation based on MC/S could be worse than w= ith the=20 > > > approach with multiple iSER targets. Hence my preference for an a= pproach=20 > > > based on multiple independent iSER connections instead of MC/S. > > >=20 > >=20 > > The idea that a simple session wide counter for command sequence nu= mber > > assignment adds such a degree of contention that it renders MC/S at= a > > performance disadvantage vs. multi-session configurations with all = of > > the extra multipath logic overhead on top is at best, a naive > > proposition. > >=20 > > On the initiator side for MC/S, literally the only thing that needs= to > > be serialized is the assignment of the command sequence number to > > individual non-immediate PDUs. The sending of the outgoing PDUs + > > immediate data by the initiator can happen out-of-order, and it's u= p to > > the target to ensure that the submission of the commands to the dev= ice > > server is in command sequence number order. > >=20 > > All of the actual immediate data + R2T -> data-out processing by th= e > > target can also be done out-of-order as well. >=20 > Right, but what he's saying is that we've taken great pains in the MQ > situation to free our issue queues of all entanglements and cross que= ue > locking so they can fly as fast as possible. If we have to assign an > in-order sequence number across all the queues, this becomes both a > cross CPU bus lock point to ensure atomicity and a sync point to ensu= re > sequencing. Na=C3=AFvely that does look to be a bottleneck which wou= ldn't > necessarily be mitigated simply by allowing everything to proceed out= of > order after this point. >=20 The point is that a simple session wide counter for command sequence number assignment is significantly less overhead than all of the overhead associated with running a full multipath stack atop multiple sessions. Not to mention that our iSCSI/iSER initiator is already taking a sessio= n wide lock when sending outgoing PDUs, so adding a session wide counter isn't adding any additional synchronization overhead vs. what's already in place. --nab