From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Oleg V. Ukhno" Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Date: Tue, 18 Jan 2011 15:40:39 +0300 Message-ID: <4D358A47.4020009@yandex-team.ru> References: <20110114190714.GA11655@yandex-team.ru> <17405.1295036019@death> <4D30D37B.6090908@yandex-team.ru> <26330.1295049912@death> <4D35060D.5080004@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jay Vosburgh , "netdev@vger.kernel.org" , "David S. Miller" To: John Fastabend Return-path: Received: from archeopterix.yandex.ru ([93.158.136.52]:41415 "EHLO archeopterix.yandex.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750709Ab1ARMkn convert rfc822-to-8bit (ORCPT ); Tue, 18 Jan 2011 07:40:43 -0500 In-Reply-To: <4D35060D.5080004@intel.com> Sender: netdev-owner@vger.kernel.org List-ID: On 01/18/2011 06:16 AM, John Fastabend wrote: > On 1/14/2011 4:05 PM, Jay Vosburgh wrote: >> Can somebody (John?) more knowledgable than I about dm-multipath >> comment on the above? > > Here I'll give it a go. > > I don't think detecting L2 link failure this way is very robust. If t= here > is a failure farther away then your immediate link your going to brea= k > completely? Your bonding hash will continue to round robin the iscsi > packets and half them will get dropped on the floor. dm-multipath han= dles > this reasonably gracefully. Also in this bonding environment you seem= to > be very sensitive to RTT times on the network. Maybe not bad out righ= t but > I wouldn't consider this robust either. John, I agree - this bonding mode should be used in quite limited numbe= r=20 of situations, but as for failure farther away then immediate link -=20 every bonding mode will suffer same problems in this case - bonding=20 detects only L2 failures, other is done by upper-layer mechanisms. And=20 almost all bonding modes depend on equal RTT on slaves. And, there is=20 already similar load balancing mode - balance-alb - what I did is=20 approximately the same, but for 802.3ad bonding mode and provides=20 "better"(more equal and non-conditional layser2) load striping for tx=20 and _rx_ . I think I shouldn't mention the particular use case of this patch - whe= n=20 I wrote it I tried to make a more general solution - my goal was "make=20 equal or near-equal load striping for TX and (most important part) RX=20 within single ethernet(layer 2) domain for TCP transmission". This=20 bonding mode just introduces ability to stripe rx and tx load for=20 single TCP connection between hosts inside of one ethernet segment.=20 iSCSI is just an example. It is possible to stripe load between a=20 linux-based router and linux-based web/ftp/etc server as well in the=20 same manner. I think this feature will be useful in some number of=20 network configurations. Also, I looked into net-next code - it seems to me that it can be=20 implemented(adapted to net-next bonding code) without any difficulties=20 and hashing function change makes no problem here. What I've written below is just my personal experience and opinion afte= r=20 5 years of using Oracle +iSCSI +mpath(later - patched bonding). From my personal experience I just can say that most iSCSI failures ar= e=20 caused by link failures, and also I would never send any significant=20 iSCSI traffic via router - router would be a bottleneck in this case. So, in my case iSCSI traffic flows within one ethernet domain and in=20 case of link failure bonding driver simply fails one slave(in case of=20 bonding) , instead of checking and failing hundreths of paths (in case=20 of mpath) and first case significantly less cpu, net and time=20 consuming(if using default mpath checker - readsector0). Mpath is good for me, when I use it to "merge" drbd mirrors from=20 different hosts, but for just doing simple load striping within single=20 L2 network switch between 2 .. 16 hosts is some overkill(particularly=20 in maintaining human-readable device naming) :). John, what is you opinion on such load balancing method in general,=20 without referring to particular use cases? > > You could tweak your scsi timeout values and fail_fast values, set th= e io > retry to 0 to cause the fail over to occur faster. I suspect you alre= ady > did this and still it is too slow? Maybe adding a checker in multipat= hd to > listen for link events would be fast enough. The checker could then f= ail > the path immediately. > > I'll try to address your comments from the other thread here. In gene= ral I > wonder if it would be better to solve the problems in dm-multipath ra= ther than > add another bonding mode? Of course I did this, but mpath is fine when device quantity is below=20 30-40 devices with two paths, 150-200 devices with 2+ paths can make=20 life far more interesting :) > > OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize = latency) > > The dm-multipath layer is adding latency? How much? If this is really= true > maybe its best to the address the real issue here and not avoid it by > using the bonding layer. I do not remember exact number now, but switching one of my databases ,= =20 about 2 years ago to bonding increased read throughput for the entire d= b=20 from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and= =20 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged i= n=20 one switch) because of "full" bandwidth use. Also, bonding usage=20 simplifies network and application setup greatly(compared to mpath) > > OVU - it handles any link failures bad, because of it's command queue > limitation(all queued commands above 32 are discarded in case of path > failure, as I remember) > > Maybe true but only link failures with the immediate peer are handled > with a bonding strategy. By working at the block layer we can detect > failures throughout the path. I would need to look into this again I > know when we were looking at this sometime ago there was some talk ab= out > improving this behavior. I need to take some time to go back through = the > error recovery stuff to remember how this works. > > OVU - it performs very bad when there are many devices and ma=D1=82y = paths(I was > unable to utilize more that 2Gbps of 4 even with 100 disks with 4 pat= hs > per each disk) Well, I think that behavior can be explained in such a way: when balancing by I/Os number per path(rr_min_io), and there is a huge=20 number of devices, mpath is doing load-balaning per-device, and it is=20 not possible to quarantee equal device use for all devices, so there=20 will be imbalance over network interface(mpath is unaware of it's=20 existence, etc), and it is likely it becomes more imbalanced when there= =20 are many devices. Also, counting I/O's for many devices and paths=20 consumes some CPU resources and also can cause excessive context switch= es. > > Hmm well that seems like something is broken. I'll try this setup whe= n > I get some time next few days. This really shouldn't be the case dm-m= ultipath > should not add a bunch of extra latency or effect throughput signific= antly. > By the way what are you seeing without mpio? And one more obsevation from my 2-years old tests - reading device(usin= g=20 dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath=20 device with single path was done at approximately 120-150mb/s, and same= =20 test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a= =20 kind of revelation to me that time. > > Thanks, > John > --=20 Best regards, Oleg Ukhno. ITO Team Lead, Yandex LLC.