* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-26 22:16 Lincoln Dale
2003-03-26 22:56 ` Lars Marowsky-Bree
0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 22:16 UTC (permalink / raw)
To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
At 10:09 AM 26/03/2003 -0600, Matt Mackall wrote:
> > >Indeed, there are iSCSI implementations that do multipath and
> > >failover.
> >
> > iSCSI is a transport.
> > logically, any "multipathing" and "failover" belongs in a layer above
> it --
> > typically as a block-layer function -- and not as a transport-layer
> > function.
> >
> > multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS,
> DevMapper
> > PowerPath, ...
>
>Funny then that I should be talking about Cisco's driver. :P
:-)
see my previous email to Jeff. iSCSI as a transport protocol does have a
muxing capability -- but its usefulness is somewhat limited (imho).
>iSCSI inherently has more interesting reconnect logic than other block
>devices, so it's fairly trivial to throw in recognition of identical
>devices discovered on two or more iSCSI targets..
what logic do you use to identify "identical devices"?
same data reported from SCSI Report_LUNs? or perhaps the same data
reported from a SCSI_Inquiry?
in reality, all multipathing software tends to use some blocks at the end
of the disk (just in the same way that most LVMs do also).
for example, consider the following output from a set of two SCSI_Inquiry
and Report_LUNs on two paths to storage:
Lun Description Table
WWPN Lun Capacity Vendor Product Serial
---------------- ----- -------- ------------ ------------ ------
Path A:
21000004cf8c21fb 0 16GB HP 18.2G ST318452FC 3EV0BD8E
21000004cf8c21c5 0 16GB HP 18.2G ST318452FC 3EV0KHHP
50060e8000009591 0 50GB HITACHI DF500F DF500-00B
50060e8000009591 1 50GB HITACHI DF500F DF500-00B
50060e8000009591 2 50GB HITACHI DF500F DF500-00B
50060e8000009591 3 50GB HITACHI DF500F DF500-00B
Path B:
31000004cf8c21fb 0 16GB HP 18.2G ST318452FC 3EV0BD8E
31000004cf8c21c5 0 16GB HP 18.2G ST318452FC 3EV0KHHP
50060e8000009591 0 50GB HITACHI DF500F DF500-00A
50060e8000009591 1 50GB HITACHI DF500F DF500-00A
50060e8000009591 2 50GB HITACHI DF500F DF500-00A
50060e8000009591 3 50GB HITACHI DF500F DF500-00A
the "HP 18.2G" devices are 18G FC disks in a FC JBOD. each disk will
report an identical Serial # regardless of the interface/path used to get
to that device. no issues there right -- you can identify the disk as
being unique via its "Serial #" and can see the interface used to get to it
via its WWPN.
now, take a look at some disk from an intelligent disk array (in this case,
a HDS 9200).
it reports a _different_ serial number for the same disk, dependent on the
interface used. (DF500 is the model # of a HDS 9200, interfaces are
numbered 00A/00B/01A/01B).
does one now need to add logic into the kernel to provide some multipathing
for HDS disks?
does using linux mean that one had to change some settings on the HDS disk
array to get it to report different information via a SCSI_Inquiry? (it
can - but thats not the point - the point is that any multipathing software
out there just 'works' right now).
this is just one example. i could probably find another 50 of
slightly-different-behavior if you wanted me to!
> > >Both iSCSI and ENBD currently have issues with pending writes during
> > >network outages. The current I/O layer fails to report failed writes
> > >to fsync and friends.
> >
> > these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
>
>Except that the issue simply doesn't show up for anyone else, which is
>why it hasn't been fixed yet. Patches are in the works, but they need
>more testing:
>
>http://www.selenic.com/linux/write-error-propagation/
oh, but it does show up for other people. it may be that the issue doesn't
show up at fsync() time, but rather at close() time, or perhaps neither of
those!
code looks interesting. i'll take a look.
hmm, must find out a way to intentionally introduce errors now and see what
happens!
cheers,
lincoln.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 22:16 [PATCH] ENBD for 2.5.64 Lincoln Dale
@ 2003-03-26 22:56 ` Lars Marowsky-Bree
2003-03-26 23:21 ` Lincoln Dale
0 siblings, 1 reply; 27+ messages in thread
From: Lars Marowsky-Bree @ 2003-03-26 22:56 UTC (permalink / raw)
To: Lincoln Dale, Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
On 2003-03-27T09:16:18,
Lincoln Dale <ltd@cisco.com> said:
> what logic do you use to identify "identical devices"?
> same data reported from SCSI Report_LUNs? or perhaps the same data
> reported from a SCSI_Inquiry?
That would work well.
We do parse device specific information in order to auto-configure the md
multipath at setup time. After that, magic is on disk...
> does one now need to add logic into the kernel to provide some multipathing
> for HDS disks?
Topology discovery is user-space! It does not need to live in the kernel.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
SuSE Labs - Research & Development, SuSE Linux AG
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 22:56 ` Lars Marowsky-Bree
@ 2003-03-26 23:21 ` Lincoln Dale
0 siblings, 0 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 23:21 UTC (permalink / raw)
To: Lars Marowsky-Bree
Cc: Matt Mackall, Jeff Garzik, ptb, Justin Cormack, linux kernel
Hi Lars,
At 11:56 PM 26/03/2003 +0100, Lars Marowsky-Bree wrote:
[..]
>We do parse device specific information in order to auto-configure the md
>multipath at setup time. After that, magic is on disk...
>
> > does one now need to add logic into the kernel to provide some multipathing
> > for HDS disks?
>
>Topology discovery is user-space! It does not need to live in the kernel.
i think we're agreeing on the same thing here!
yes, i believe topology discovery should only belong in userspace.
i believe it should be in userspace for both (a) setup and (b) at
kernel-boot-time
likewise, i believe policy of deciding what mix of i/o's to put down
different paths also belongs in userspace.
this could take the form of a daemon that frequently looks up statistics
from the kernel (e.g. average latency per target), and uses that
information in conjunction with some 'policy' to tweak what paths are used.
but i definitely don't think that the kernel should make any wide-ranging
decisions about multiple paths, except beyond something like "deviceA has
disappeared. i know that deviceB is an alternate path, so will swing all
outstanding i/o plugged into A to B".
cheers,
lincoln.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-28 11:19 ` Pavel Machek
@ 2003-03-30 20:48 ` Peter T. Breuer
0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-30 20:48 UTC (permalink / raw)
To: Pavel Machek; +Cc: Peter T. Breuer, Justin Cormack, linux kernel
"A month of sundays ago Pavel Machek wrote:"
> Hi!
>
> > 9) it drops into a mode where it md5sums both ends and skips writes
> > of equal blocks, if that's faster. It moves in and out of this mode
> > automatically. This helps RAID resyncs (2* overspeed is common on
> > 100BT nets, that is 20MB/s.).
>
> Great way to find md5 collisions, I guess
> :-).
Don't worry, I'm not planning on claiming the Turing medal! Or living for
the lifetime of the universe .. :-(.
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-25 20:53 ` Peter T. Breuer
2003-03-26 2:40 ` Jeff Garzik
[not found] ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
@ 2003-03-28 11:19 ` Pavel Machek
2003-03-30 20:48 ` Peter T. Breuer
2 siblings, 1 reply; 27+ messages in thread
From: Pavel Machek @ 2003-03-28 11:19 UTC (permalink / raw)
To: Peter T. Breuer; +Cc: Justin Cormack, linux kernel
Hi!
> 9) it drops into a mode where it md5sums both ends and skips writes
> of equal blocks, if that's faster. It moves in and out of this mode
> automatically. This helps RAID resyncs (2* overspeed is common on
> 100BT nets, that is 20MB/s.).
Great way to find md5 collisions, I guess
:-).
Pavel
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 23:49 ` Lincoln Dale
@ 2003-03-27 0:08 ` Peter T. Breuer
0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-27 0:08 UTC (permalink / raw)
To: Lincoln Dale; +Cc: ptb, Jeff Garzik, Matt Mackall, Justin Cormack, linux kernel
"Lincoln Dale wrote:"
> Hi Peter,
Hi!
> decent GE cards will do coalescing themselves anyway.
From what I confusedly remember of my last interchange with someone
convinced that packet coalescing (or lack of it, I forget which)
was the root of all evil, it's "all because" there's some magic limit
of 8K interrupts per second somewhere, and at 1.5KB per packet, that
would be only 12MB/s. So Ge cards wait after each interrupt to see if
there's some more stuff coming, so that they can treat more than one
packet at a time.
Apparently that means that if you have a two-way interchange in
your protocol at low level, they wait at the end of each half of
the protocol, even though you can't proceed with the protocol
until they decide to stop listening and start working. And the
result is a severe slowdown.
In my naive opinion, hat should make ENBD's architecture (in which all
the channels going through the same NIC nevertheless work independently
and asynchronously) have an advantage, because pipelining effects
will fill up the slack time spaces in one channel's protocol with
activity from other channels.
But surely the number of channels required to fill up the waiting time
woulod be astronomical? Oh well.
Anyway, my head still spins.
The point is that none of this is as easy or straightforward as it
seems. I suspect that pure storage people like andre will make a real
mess of the networking considerations. It's just not easy.
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 22:02 ` Peter T. Breuer
@ 2003-03-26 23:49 ` Lincoln Dale
2003-03-27 0:08 ` Peter T. Breuer
0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 23:49 UTC (permalink / raw)
To: ptb; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel
Hi Peter,
At 11:02 PM 26/03/2003 +0100, Peter T. Breuer wrote:
>I'll content myself with mentioning that ENBD has /always/ throughout
>its five years of life had automatic failover between channels. Mind
>you, I don't think anybody makes use of the multichannel architecture in
>practice for the purposes of redundancy (i.e. people using multiple
>channels don't pass them through different interfaces or routes, which
>is the idea!), they may do it for speed/bandwidth.
>
>But then surely they might as well use channel bonding in the network layer?
>I've never tried it, or possibly never figured out how ..
"channel bonding" can handle cases whereby you lose a single NIC or port --
but typically channeling means that you need multiple paths into a single
ethernet switch.
single ethernet switch = single point of failure.
hence, from a high-availability (HA) perspective, you're better off
connecting N NICs into N switches -- and then load-balance (multipath)
across those.
an interesting side-note is that channel-bonding doesn't necessarily mean
higher performance.
i haven't looked at linux's channel-bonding, but many NICs on higher-end
servers offer this as an option, and when enabled, you end up with multiple
NICs with the same MAC address. typically only one NIC is used for one
direction of traffic.
> > the reason why goes back to how SCSI works. take a ethereal trace of
> iSCSI
> > and you'll see the way that 2 round-trips are used before any typical i/o
> > operation (read or write op) occurs.
>
>Hmm.
>I have some people telling me that I should pile up network packets
>in order to avoid too many interrupts firing on Ge cards, and other
>people telling me to send partial packets as soon as possible in order
>to avoid buffer buildup. My head spins.
:-)
most "storage" people care more about latency than they do about raw
performance. coalescing packets = bad for latency.
i figure there has to be middle ground somewhere -- implement both and have
it as a config option.
decent GE cards will do coalescing themselves anyway.
cheers,
lincoln.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 23:03 ` Lincoln Dale
@ 2003-03-26 23:39 ` Andre Hedrick
0 siblings, 0 replies; 27+ messages in thread
From: Andre Hedrick @ 2003-03-26 23:39 UTC (permalink / raw)
To: Lincoln Dale; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel
On Thu, 27 Mar 2003, Lincoln Dale wrote:
> > > to traditional SAN storage and you're gatewaying into Fibre Channel).
> >
> >Why a SAN gateway switch, they are all LAN limited.
>
> ?
> hmm, where to start:
>
> why a SAN gateway?
> because (a) that's what is out there right now, (b) iSCSI is really the
> enabler for people to connect to consolodated storage (that they already
> have) at a cheaper price-point than FC.
>
> LAN limited?
> 10GE is reality. so is etherchannel where you have 8xGE trunked
> together. "LAN is limited" is a rather bold statement that doesn't support
> the facts.
>
> in reality, most applications do NOT want to push 100mbyte/sec of i/o -- or
> even 20mbyte/sec.
> sure -- benchmarking programs do -- and i could show you a single host
> pushing 425mbyte/sec using 2 x 2gbit/s FC HBAs -- but in reality, thats way
> overkill for most people.
We agree this is even overkill for people like Pixar and the movie people.
> i know that your company is working on native iSCSI storage arrays;
> obviously its in your interests to talk about native iSCSI access to disks,
> but right now, i'll talk about how people deploy TB of storage today. this
> is most likely a different market segment to what you're working on (at
> least i hope you think it is) - but a discussion on those merits are not
> something that is useful in l-k.
Well we deploy ERL=1 or ERL=2(%80) today on 6TB platforms now.
So the democratization of SAN is now and today.
> > > handling multipathing in that manner is well beyond the scope of what an
> > > iSCSI driver in the kernel should be doing.
> > > determining the policy (read-preferred / write-preferred / round-robin /
> > > ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> > > used is most definitely something that should NEVER be in the kernel.
> >
> >Only "NEVER" if you are depending on classic bloated SAN
> >hardware/gateways. The very operations you are calling never, is done in
> >the gateways which is nothing more or less than an embedded system on
> >crack. So if this is an initiator which can manage sequencing streams, it
> >is far superior than dealing with the SAN traps of today.
>
> err, either you don't understand multipathing or i don't.
>
> "multipathing" is end-to-end between an initiator and a target.
> typically that initiator is a host and multipathing software is installed
> on that host.
> the target is typically a disk or disk-array. the disk array may have
> multiple controllers and those show up as multiple targets.
Agreed, and apply as series of head-to-toe target-initiator pairs and you
get multipathing support native from the super target. This is all a SAN
gateway/switch does. Not much more than LVM on crack and a six-pack.
> the thing about multipathing is that it doesn't depend on any magic in "SAN
> hardware/gateways" (sic) -- its simply a case of the host seeing the same
> disk via two interfaces and choosing to use one/both of those interfaces to
> talk to that disk.
Well Storage is nothing by a LIE and regardless if one spoofs and ident
mode pages or not, they must track and manage the resource reporting
properly.
> [..]
> >What do you have for real iSCSI and no FC junk not supporting
> >interoperability?
>
> ?
> no idea what you're talking about here.
Erm, shove a MacData and Brocade switch on the same FC network and watch
it turn into a degraded dog.
> >FC is dying and nobody who has wasted money on FC junk will be interested
> >in iSCSI. They wasted piles of money and have to justify it.
>
> lets just agree to disagree. i don't hold that view.
Guess that is why NetAPP snaked a big share of EMC's marketspace with a
cheaper mousetrap. Agreed to "agree to disagree" erm whatever I just
typed.
> > > not bad for a single TCP stream and a software iSCSI stack. :-)
> > > (kernel is 2.4.20)
> >
> >Nice numbers, now do it over WAN.
>
> sustaining the throughput is simply a matter of:
> - having a large enough TCP window
> - ensuring all the TCP go-fast options are enabled
> - ensuring you can have a sufficient number of IO operations outstanding
> to allow SCSI to actually be able to fill the TCP window.
>
> realistically, yes, this can sustain high throughput across a WAN. but
> that WAN has to be built right in the first place.
Well sell more of those high bandwidth switches to the world of
internet-ether to make it faster, I would be happier.
> i.e. if its moving other traffic, provide QoS to allow storage traffic to
> have preference.
>
> >Sweet kicker here, if you only allow the current rules of SAN to apply.
> >This is what the big dogs want, and no new ideas allowed.
>
> i definitely don't subscribe to your conspiracy theories here. sorry.
You should listen to more Art Bell at night, well morning for you.
> >PS poking back at you for fun and serious points.
>
> yes - i figured. i'm happy to have a meaningful technical discussion, but
> don't have the cycles to discuss the universe.
I did the universe once as an academic, it was fun.
http://schwab.tsuniv.edu/t13.html
This was my last time of stargazing and I miss it too!
Cheers,
Andre Hedrick
LAD Storage Consulting Group
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
[not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
@ 2003-03-26 23:03 ` Lincoln Dale
2003-03-26 23:39 ` Andre Hedrick
0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 23:03 UTC (permalink / raw)
To: Andre Hedrick
Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel
Andre,
At 02:32 PM 26/03/2003 -0800, Andre Hedrick wrote:
> > in reality, if you had multiple TCP streams, its more likely you're doing
> > it for high-availability reasons (i.e. multipathing).
> > if you're multipathing, the chances are you want to multipath down two
> > separate paths to two different iSCSI gateways. (assuming you're talking
> > to traditional SAN storage and you're gatewaying into Fibre Channel).
>
>Why a SAN gateway switch, they are all LAN limited.
?
hmm, where to start:
why a SAN gateway?
because (a) that's what is out there right now, (b) iSCSI is really the
enabler for people to connect to consolodated storage (that they already
have) at a cheaper price-point than FC.
LAN limited?
10GE is reality. so is etherchannel where you have 8xGE trunked
together. "LAN is limited" is a rather bold statement that doesn't support
the facts.
in reality, most applications do NOT want to push 100mbyte/sec of i/o -- or
even 20mbyte/sec.
sure -- benchmarking programs do -- and i could show you a single host
pushing 425mbyte/sec using 2 x 2gbit/s FC HBAs -- but in reality, thats way
overkill for most people.
i know that your company is working on native iSCSI storage arrays;
obviously its in your interests to talk about native iSCSI access to disks,
but right now, i'll talk about how people deploy TB of storage today. this
is most likely a different market segment to what you're working on (at
least i hope you think it is) - but a discussion on those merits are not
something that is useful in l-k.
> > handling multipathing in that manner is well beyond the scope of what an
> > iSCSI driver in the kernel should be doing.
> > determining the policy (read-preferred / write-preferred / round-robin /
> > ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> > used is most definitely something that should NEVER be in the kernel.
>
>Only "NEVER" if you are depending on classic bloated SAN
>hardware/gateways. The very operations you are calling never, is done in
>the gateways which is nothing more or less than an embedded system on
>crack. So if this is an initiator which can manage sequencing streams, it
>is far superior than dealing with the SAN traps of today.
err, either you don't understand multipathing or i don't.
"multipathing" is end-to-end between an initiator and a target.
typically that initiator is a host and multipathing software is installed
on that host.
the target is typically a disk or disk-array. the disk array may have
multiple controllers and those show up as multiple targets.
the thing about multipathing is that it doesn't depend on any magic in "SAN
hardware/gateways" (sic) -- its simply a case of the host seeing the same
disk via two interfaces and choosing to use one/both of those interfaces to
talk to that disk.
[..]
>What do you have for real iSCSI and no FC junk not supporting
>interoperability?
?
no idea what you're talking about here.
>FC is dying and nobody who has wasted money on FC junk will be interested
>in iSCSI. They wasted piles of money and have to justify it.
lets just agree to disagree. i don't hold that view.
> > not bad for a single TCP stream and a software iSCSI stack. :-)
> > (kernel is 2.4.20)
>
>Nice numbers, now do it over WAN.
sustaining the throughput is simply a matter of:
- having a large enough TCP window
- ensuring all the TCP go-fast options are enabled
- ensuring you can have a sufficient number of IO operations outstanding
to allow SCSI to actually be able to fill the TCP window.
realistically, yes, this can sustain high throughput across a WAN. but
that WAN has to be built right in the first place.
i.e. if its moving other traffic, provide QoS to allow storage traffic to
have preference.
>Sweet kicker here, if you only allow the current rules of SAN to apply.
>This is what the big dogs want, and no new ideas allowed.
i definitely don't subscribe to your conspiracy theories here. sorry.
>PS poking back at you for fun and serious points.
yes - i figured. i'm happy to have a meaningful technical discussion, but
don't have the cycles to discuss the universe.
cheers,
lincoln.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
[not found] ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
@ 2003-03-26 22:40 ` Matt Mackall
0 siblings, 0 replies; 27+ messages in thread
From: Matt Mackall @ 2003-03-26 22:40 UTC (permalink / raw)
To: Lincoln Dale; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
On Thu, Mar 27, 2003 at 09:10:14AM +1100, Lincoln Dale wrote:
> At 10:09 AM 26/03/2003 -0600, Matt Mackall wrote:
> >> >Indeed, there are iSCSI implementations that do multipath and
> >> >failover.
> >>
> >> iSCSI is a transport.
> >> logically, any "multipathing" and "failover" belongs in a layer above
> >it --
> >> typically as a block-layer function -- and not as a transport-layer
> >> function.
> >>
> >> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS,
> >DevMapper
> >> PowerPath, ...
> >
> >Funny then that I should be talking about Cisco's driver. :P
>
> :-)
>
> see my previous email to Jeff. iSCSI as a transport protocol does have a
> muxing capability -- but its usefulness is somewhat limited (imho).
>
> >iSCSI inherently has more interesting reconnect logic than other block
> >devices, so it's fairly trivial to throw in recognition of identical
> >devices discovered on two or more iSCSI targets..
>
> what logic do you use to identify "identical devices"?
> same data reported from SCSI Report_LUNs? or perhaps the same data
> reported from a SCSI_Inquiry?
Sorry, can't remember.
> does one now need to add logic into the kernel to provide some multipathing
> for HDS disks?
No, most of it was done in userspace.
> >> >Both iSCSI and ENBD currently have issues with pending writes during
> >> >network outages. The current I/O layer fails to report failed writes
> >> >to fsync and friends.
> >>
> >> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
> >
> >Except that the issue simply doesn't show up for anyone else, which is
> >why it hasn't been fixed yet. Patches are in the works, but they need
> >more testing:
> >
> >http://www.selenic.com/linux/write-error-propagation/
>
> oh, but it does show up for other people. it may be that the issue doesn't
> show up at fsync() time, but rather at close() time, or perhaps neither of
> those!
Write errors basically don't happen for people who have attached
storage unless their drives die. Which is why the fact that the
pagecache completely ignores I/O errors has gone unnoticed for years..
> code looks interesting. i'll take a look.
> hmm, must find out a way to intentionally introduce errors now and see what
> happens!
We stumbled on it by pulling cables to make failover happen.
--
Matt Mackall : http://www.selenic.com : of or relating to the moon
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 22:16 Lincoln Dale
@ 2003-03-26 22:32 ` Andre Hedrick
[not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
1 sibling, 0 replies; 27+ messages in thread
From: Andre Hedrick @ 2003-03-26 22:32 UTC (permalink / raw)
To: Lincoln Dale; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel
On Thu, 27 Mar 2003, Lincoln Dale wrote:
> while the iSCSI spec has the concept of a "network portal" that can have
> multiple TCP streams for i/o, in the real world, i'm yet to see anything
> actually use those multiple streams.
Want a DEMO? It is call Sync-WAN-Raid-Relay.
> the reason why goes back to how SCSI works. take a ethereal trace of iSCSI
> and you'll see the way that 2 round-trips are used before any typical i/o
> operation (read or write op) occurs.
> multiple TCP streams for a given iSCSI session could potentially be used to
> achieve greater performance when the maximum-window-size of a single TCP
> stream is being hit.
> but its quite rare for this to happen.
>
> in reality, if you had multiple TCP streams, its more likely you're doing
> it for high-availability reasons (i.e. multipathing).
> if you're multipathing, the chances are you want to multipath down two
> separate paths to two different iSCSI gateways. (assuming you're talking
> to traditional SAN storage and you're gatewaying into Fibre Channel).
Why a SAN gateway switch, they are all LAN limited.
> handling multipathing in that manner is well beyond the scope of what an
> iSCSI driver in the kernel should be doing.
> determining the policy (read-preferred / write-preferred / round-robin /
> ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> used is most definitely something that should NEVER be in the kernel.
Only "NEVER" if you are depending on classic bloated SAN
hardware/gateways. The very operations you are calling never, is done in
the gateways which is nothing more or less than an embedded system on
crack. So if this is an initiator which can manage sequencing streams, it
is far superior than dealing with the SAN traps of today.
> btw, the performance of iSCSI over a single TCP stream is also a moot one also.
> from a single host (IBM x335 Server i think?) communicating with a FC disk
> via an iSCSI gateway:
> mds# sh int gig2/1
> GigabitEthernet2/1 is up
> Hardware is GigabitEthernet, address is xxxx.xxxx.xxxx
> Internet address is xxx.xxx.xxx.xxx/24
> MTU 1500 bytes, BW 1000000 Kbit
> Port mode is IPS
> Speed is 1 Gbps
> Beacon is turned off
> 5 minutes input rate 21968640 bits/sec, 2746080 bytes/sec,
> 40420 frames/sec
> 5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec,
> 80679 frames/sec
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 74228360 packets input, 13218256042 bytes
> 15409 multicast frames, 0 compressed
> 0 input errors, 0 frame, 0 overrun 0 fifo
> 169487726 packets output, 241066793565 bytes, 0 underruns
> 0 output errors, 0 collisions, 0 fifo
> 0 carrier errors
What do you have for real iSCSI and no FC junk not supporting
interoperability?
FC is dying and nobody who has wasted money on FC junk will be interested
in iSCSI. They wasted piles of money and have to justify it.
> not bad for a single TCP stream and a software iSCSI stack. :-)
> (kernel is 2.4.20)
Nice numbers, now do it over WAN.
> >>>Both iSCSI and ENBD currently have issues with pending writes during
> >>>network outages. The current I/O layer fails to report failed writes
> >>>to fsync and friends.
> >
> >...not if your iSCSI implementation is up to spec. ;-)
> >
> >>these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
> >
> >VFS+VM. But, agreed.
>
> sure - the devil is in the details - but the issue holds true for
> traditional block devices at this point also.
Sweet kicker here, if you only allow the current rules of SAN to apply.
This is what the big dogs want, and no new ideas allowed.
Cheers,
Andre Hedrick
LAD Storage Consulting Group
PS poking back at you for fun and serious points.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-26 22:16 Lincoln Dale
2003-03-26 22:32 ` Andre Hedrick
[not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
0 siblings, 2 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 22:16 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel
At 08:49 AM 26/03/2003 -0500, Jeff Garzik wrote:
>>>Indeed, there are iSCSI implementations that do multipath and
>>>failover.
>>
>>iSCSI is a transport.
>>logically, any "multipathing" and "failover" belongs in a layer above it
>>-- typically as a block-layer function -- and not as a transport-layer
>>function.
>>multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS,
>>DevMapper -- or in a commercial implementation such as Veritas VxDMP, HDS
>>HDLM, EMC PowerPath, ...
>
>I think you will find that most Linux kernel developers agree w/ you :)
>
>That said, iSCSI error recovery can be considered as supporting some of
>what multipathing and failover accomplish. iSCSI can be shoving bits
>through multiple TCP connections, or fail over from one TCP connection to
>another.
while the iSCSI spec has the concept of a "network portal" that can have
multiple TCP streams for i/o, in the real world, i'm yet to see anything
actually use those multiple streams.
the reason why goes back to how SCSI works. take a ethereal trace of iSCSI
and you'll see the way that 2 round-trips are used before any typical i/o
operation (read or write op) occurs.
multiple TCP streams for a given iSCSI session could potentially be used to
achieve greater performance when the maximum-window-size of a single TCP
stream is being hit.
but its quite rare for this to happen.
in reality, if you had multiple TCP streams, its more likely you're doing
it for high-availability reasons (i.e. multipathing).
if you're multipathing, the chances are you want to multipath down two
separate paths to two different iSCSI gateways. (assuming you're talking
to traditional SAN storage and you're gatewaying into Fibre Channel).
handling multipathing in that manner is well beyond the scope of what an
iSCSI driver in the kernel should be doing.
determining the policy (read-preferred / write-preferred / round-robin /
ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
used is most definitely something that should NEVER be in the kernel.
btw, the performance of iSCSI over a single TCP stream is also a moot one also.
from a single host (IBM x335 Server i think?) communicating with a FC disk
via an iSCSI gateway:
mds# sh int gig2/1
GigabitEthernet2/1 is up
Hardware is GigabitEthernet, address is xxxx.xxxx.xxxx
Internet address is xxx.xxx.xxx.xxx/24
MTU 1500 bytes, BW 1000000 Kbit
Port mode is IPS
Speed is 1 Gbps
Beacon is turned off
5 minutes input rate 21968640 bits/sec, 2746080 bytes/sec,
40420 frames/sec
5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec,
80679 frames/sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
74228360 packets input, 13218256042 bytes
15409 multicast frames, 0 compressed
0 input errors, 0 frame, 0 overrun 0 fifo
169487726 packets output, 241066793565 bytes, 0 underruns
0 output errors, 0 collisions, 0 fifo
0 carrier errors
not bad for a single TCP stream and a software iSCSI stack. :-)
(kernel is 2.4.20)
>>>Both iSCSI and ENBD currently have issues with pending writes during
>>>network outages. The current I/O layer fails to report failed writes
>>>to fsync and friends.
>
>...not if your iSCSI implementation is up to spec. ;-)
>
>>these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
>
>VFS+VM. But, agreed.
sure - the devil is in the details - but the issue holds true for
traditional block devices at this point also.
cheers,
lincoln.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
[not found] <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>
@ 2003-03-26 22:02 ` Peter T. Breuer
2003-03-26 23:49 ` Lincoln Dale
0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-26 22:02 UTC (permalink / raw)
To: Lincoln Dale; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel
"Lincoln Dale wrote:"
> >what multipathing and failover accomplish. iSCSI can be shoving bits
> >through multiple TCP connections, or fail over from one TCP connection to
> >another.
>
> while the iSCSI spec has the concept of a "network portal" that can have
> multiple TCP streams for i/o, in the real world, i'm yet to see anything
> actually use those multiple streams.
I'll content myself with mentioning that ENBD has /always/ throughout
its five years of life had automatic failover between channels. Mind
you, I don't think anybody makes use of the multichannel architecture in
practice for the purposes of redundancy (i.e. people using multiple
channels don't pass them through different interfaces or routes, which
is the idea!), they may do it for speed/bandwidth.
But then surely they might as well use channel bonding in the network layer?
I've never tried it, or possibly never figured out how ..
> the reason why goes back to how SCSI works. take a ethereal trace of iSCSI
> and you'll see the way that 2 round-trips are used before any typical i/o
> operation (read or write op) occurs.
Hmm.
I have some people telling me that I should pile up network packets
in order to avoid too many interrupts firing on Ge cards, and other
people telling me to send partial packets as soon as possible in order
to avoid buffer buildup. My head spins.
> multiple TCP streams for a given iSCSI session could potentially be used to
> achieve greater performance when the maximum-window-size of a single TCP
> stream is being hit.
> but its quite rare for this to happen.
My considered opinion is that there are way too many variables here for
anyone to make sense of them.
> in reality, if you had multiple TCP streams, its more likely you're doing
> it for high-availability reasons (i.e. multipathing).
Except that in real life most people don't know what they're doing and
they certainly don't know why they're doing it! In particular they
don't seem to get the idea that more redundancy is what they want.
I can almost see why.
But they can be persuaded to run multichannel by being promised more
speed.
> if you're multipathing, the chances are you want to multipath down two
> separate paths to two different iSCSI gateways. (assuming you're talking
> to traditional SAN storage and you're gatewaying into Fibre Channel).
Yes. This is all that really makes sense for redundancy. And make sure
the routing is distinct too.
Then you start having problems maintaining request order across
multiple paths. At least I do. But ENBD does it.
> determining the policy (read-preferred / write-preferred / round-robin /
> ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> used is most definitely something that should NEVER be in the kernel.
ENBD doesn't have any problem - it uses all the channels, by demand.
Each userspace daemon runs a different channel and each daemon picks
up requests to treat as soon as it can, as soon as there are any. The
kernel does not dictate. It's async.
(iscsi stream over tcp)
> 5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec,
> 80679 frames/sec
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Very impressive. I think the most that's been seen over ENBD is 60MB/s
sustained, across Ge.
> not bad for a single TCP stream and a software iSCSI stack. :-)
> (kernel is 2.4.20)
Ditto.
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 7:31 ` Lincoln Dale
2003-03-26 9:59 ` Lars Marowsky-Bree
2003-03-26 13:49 ` Jeff Garzik
@ 2003-03-26 16:09 ` Matt Mackall
2 siblings, 0 replies; 27+ messages in thread
From: Matt Mackall @ 2003-03-26 16:09 UTC (permalink / raw)
To: Lincoln Dale; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
On Wed, Mar 26, 2003 at 06:31:31PM +1100, Lincoln Dale wrote:
> At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> >> Yeah, iSCSI handles all that and more. It's a behemoth of a
> >> specification. (whether a particular implementation implements all that
> >> stuff correctly is another matter...)
> >
> >Indeed, there are iSCSI implementations that do multipath and
> >failover.
>
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it --
> typically as a block-layer function -- and not as a transport-layer
> function.
>
> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, DevMapper
> PowerPath, ...
Funny then that I should be talking about Cisco's driver. :P
iSCSI inherently has more interesting reconnect logic than other block
devices, so it's fairly trivial to throw in recognition of identical
devices discovered on two or more iSCSI targets..
> >Both iSCSI and ENBD currently have issues with pending writes during
> >network outages. The current I/O layer fails to report failed writes
> >to fsync and friends.
>
> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
Except that the issue simply doesn't show up for anyone else, which is
why it hasn't been fixed yet. Patches are in the works, but they need
more testing:
http://www.selenic.com/linux/write-error-propagation/
--
Matt Mackall : http://www.selenic.com : of or relating to the moon
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 6:59 ` Andre Hedrick
@ 2003-03-26 13:58 ` Jeff Garzik
0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-03-26 13:58 UTC (permalink / raw)
To: Andre Hedrick; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel
Andre Hedrick wrote:
> We have almost finalized our initiator to be submitted under OSL/GPL.
> This will be a full RFC ERL=2 w/ Sync-n-Steering.
That's pretty good news.
Also, I tangent and mention that I have been won over WRT OSL: with its
more tight "lawyerspeak" and mutual patent defense clauses, I consider
OSL to be a "better GPL" license.
I would in fact love to see the Linux kernel relicensed under OSL. I
think that would close some "holes" that exist with the GPL, and give us
a better legal standing. But relicensing the kernel would be huge
political undertaking, and I sure as hell don't have the energy, even if
it possible. No idea if Linus, Alan, Andrew, or any of the other major
contributors would go for it, either.
Jeff, the radical
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 7:31 ` Lincoln Dale
2003-03-26 9:59 ` Lars Marowsky-Bree
@ 2003-03-26 13:49 ` Jeff Garzik
2003-03-26 16:09 ` Matt Mackall
2 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-03-26 13:49 UTC (permalink / raw)
To: Lincoln Dale; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel
Lincoln Dale wrote:
> At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
>
>> > Yeah, iSCSI handles all that and more. It's a behemoth of a
>> > specification. (whether a particular implementation implements all
>> that
>> > stuff correctly is another matter...)
>>
>> Indeed, there are iSCSI implementations that do multipath and
>> failover.
>
>
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it
> -- typically as a block-layer function -- and not as a transport-layer
> function.
>
> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS,
> DevMapper -- or in a commercial implementation such as Veritas VxDMP,
> HDS HDLM, EMC PowerPath, ...
I think you will find that most Linux kernel developers agree w/ you :)
That said, iSCSI error recovery can be considered as supporting some of
what multipathing and failover accomplish. iSCSI can be shoving bits
through multiple TCP connections, or fail over from one TCP connection
to another.
>> Both iSCSI and ENBD currently have issues with pending writes during
>> network outages. The current I/O layer fails to report failed writes
>> to fsync and friends.
...not if your iSCSI implementation is up to spec. ;-)
> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
VFS+VM. But, agreed.
Jeff
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 9:59 ` Lars Marowsky-Bree
@ 2003-03-26 10:18 ` Andrew Morton
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-03-26 10:18 UTC (permalink / raw)
To: Lars Marowsky-Bree; +Cc: linux-kernel
Lars Marowsky-Bree <lmb@suse.de> wrote:
>
> In particular with ENBD, a partial write could occur at the block device
> layer. Now try to report that upwards to the write() call. Good luck.
Well you can't, unless it is an O_SYNC or O_DIRECT write...
But yes, for a normal old write() followed by an fsync() the IO error can be
lost. We'll fix this for 2.6. I have oxymoron's patches lined up, but they
need a couple of quality hours' worth of thinking about yet.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 7:31 ` Lincoln Dale
@ 2003-03-26 9:59 ` Lars Marowsky-Bree
2003-03-26 10:18 ` Andrew Morton
2003-03-26 13:49 ` Jeff Garzik
2003-03-26 16:09 ` Matt Mackall
2 siblings, 1 reply; 27+ messages in thread
From: Lars Marowsky-Bree @ 2003-03-26 9:59 UTC (permalink / raw)
To: linux kernel
On 2003-03-26T18:31:31,
Lincoln Dale <ltd@cisco.com> said:
> >Indeed, there are iSCSI implementations that do multipath and
> >failover.
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it --
"Multipathing" on iSCSI is actually a layer below - network resiliency is
handled by routing protocols, the switching fabric etc.
> >Both iSCSI and ENBD currently have issues with pending writes during
> >network outages. The current I/O layer fails to report failed writes
> >to fsync and friends.
> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
Yes, and it is a fairly annoying issue... In particular with ENBD, a partial
write could occur at the block device layer. Now try to report that upwards to
the write() call. Good luck.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
SuSE Labs - Research & Development, SuSE Linux AG
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 5:55 ` Matt Mackall
2003-03-26 6:31 ` Peter T. Breuer
2003-03-26 6:59 ` Andre Hedrick
@ 2003-03-26 7:31 ` Lincoln Dale
2003-03-26 9:59 ` Lars Marowsky-Bree
` (2 more replies)
2 siblings, 3 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 7:31 UTC (permalink / raw)
To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> > Yeah, iSCSI handles all that and more. It's a behemoth of a
> > specification. (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
>Indeed, there are iSCSI implementations that do multipath and
>failover.
iSCSI is a transport.
logically, any "multipathing" and "failover" belongs in a layer above it --
typically as a block-layer function -- and not as a transport-layer function.
multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, DevMapper
-- or in a commercial implementation such as Veritas VxDMP, HDS HDLM, EMC
PowerPath, ...
>Both iSCSI and ENBD currently have issues with pending writes during
>network outages. The current I/O layer fails to report failed writes
>to fsync and friends.
these are not "iSCSI" or "ENBD" issues. these are issues with VFS.
cheers,
lincoln.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 6:48 ` Matt Mackall
@ 2003-03-26 7:05 ` Peter T. Breuer
0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-26 7:05 UTC (permalink / raw)
To: Matt Mackall; +Cc: Peter T. Breuer, Jeff Garzik, Justin Cormack, linux kernel
"A month of sundays ago Matt Mackall wrote:"
> > > Both iSCSI and ENBD currently have issues with pending writes during
> > > network outages. The current I/O layer fails to report failed writes
> > > to fsync and friends.
> >
> > ENBD has two (configurable) behaviors here. Perhaps it should have
> > more. By default it blocks pending reads and writes during times when
> > the connection is down. It can be configured to error them instead.
>
> And in this case, the upper layers will silently drop write errors on
> current kernels.
>
> Cisco's Linux iSCSI driver has a configurable timeout, defaulting to
> 'infinite', btw.
That corresponds to enbd's default behavior. Sigh. Guess I'll have to
make it 0-infty, instead of 0 or infty. It's easy enough - just need to
make it settable in proc (and think about which is the one line I need
to touch ...).
> Hrrmm. The potential to lose data by surprise here is not terribly
> appealing.
Making the driver "intelligent" is indeed bad news for the more
intelligent admin. I was thinking of making it default to 0 timeout if
it knows it's running under raid, but I have a natural antipathy to
such in-driver decisions. My conscience would be slightly less on
alert if the userspace daemon did the decision-making. I suppose it
could.
> It might be better to add an accounting mechanism to say
> "never go above x dirty pages against block device n" or something of
> the sort but you can still get into trouble if you happen to have
> hundreds of iSCSI devices each with their own request queue..
Well, you can get in trouble if you allow even a single dirty page to
be outstanding to something, and have thousands of those somethings.
That's not the normal situation, however, whereas it is normal to
have a single network device and to be writing pell-mell to it
oblivious to the state of the device itself.
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 5:55 ` Matt Mackall
2003-03-26 6:31 ` Peter T. Breuer
@ 2003-03-26 6:59 ` Andre Hedrick
2003-03-26 13:58 ` Jeff Garzik
2003-03-26 7:31 ` Lincoln Dale
2 siblings, 1 reply; 27+ messages in thread
From: Andre Hedrick @ 2003-03-26 6:59 UTC (permalink / raw)
To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
On Tue, 25 Mar 2003, Matt Mackall wrote:
> > Yeah, iSCSI handles all that and more. It's a behemoth of a
> > specification. (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
> Indeed, there are iSCSI implementations that do multipath and
> failover.
>
> Both iSCSI and ENBD currently have issues with pending writes during
> network outages. The current I/O layer fails to report failed writes
> to fsync and friends.
>
> > BTW, I'm a big enbd fan :) I like enbd for it's _simplicity_ compared
> > to iSCSI.
>
> Definitely. The iSCSI protocol is more powerful but _much_ more
> complex than ENBD. I've spent two years working on iSCSI but guess
> which I use at home..
To be totally fair ENBD/NBD is not SAN nor will it ever become a qualified
SAN. I would like to see what happens to your data if you wrap your
ethernet around a ballast resistor or even run it near by the associated
light fixture, and blink the the power/lights. This is where goofy people
run cables in the drop ceilings.
We have almost finalized our initiator to be submitted under OSL/GPL.
This will be a full RFC ERL=2 w/ Sync-n-Steering.
I have seen to much GPL code stolen and could do nothing about it.
While the code is not in binary executable form it shall be under OSL
only. Only at compile and execution time will it become GPL compliant
period. This is designed to extend the copyright holders rights to force
anyone who uses the code and changes anything to return to open to the
copyright holder period. Additional language may be added to permit not
return code to exist under extremely heavy licensing fees to be used to
promote OSL projects and assist any GPL holder with litigation fees to
protect their rights. To many of us do not have the means to defend our
copyrights, this is one way I can see to provide a plausable solution to a
dirty problem. This may not be the best answer but it is doable. This
will apply some teeth into the license to stop this from happening again.
Like it or hate it, OSL/GPL looks to be the best match out there.
Regards,
Andre Hedrick, CTO & Founder
iSCSI Software Solutions Provider
http://www.PyXTechnologies.com/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 6:31 ` Peter T. Breuer
@ 2003-03-26 6:48 ` Matt Mackall
2003-03-26 7:05 ` Peter T. Breuer
0 siblings, 1 reply; 27+ messages in thread
From: Matt Mackall @ 2003-03-26 6:48 UTC (permalink / raw)
To: Peter T. Breuer; +Cc: Jeff Garzik, Justin Cormack, linux kernel
> > Both iSCSI and ENBD currently have issues with pending writes during
> > network outages. The current I/O layer fails to report failed writes
> > to fsync and friends.
>
> ENBD has two (configurable) behaviors here. Perhaps it should have
> more. By default it blocks pending reads and writes during times when
> the connection is down. It can be configured to error them instead.
And in this case, the upper layers will silently drop write errors on
current kernels.
Cisco's Linux iSCSI driver has a configurable timeout, defaulting to
'infinite', btw.
> What I would like is some way of telling how backed up the VM is
> against us. If the VM is full of dirty buffers aimed at us, then
> I think we should consider erroring instead of blocking. The problem is
> that at that point we're likely not getting any requests at all,
> because the kernel long ago ran out of the 256 requests it has in
> hand to send us.
Hrrmm. The potential to lose data by surprise here is not terribly
appealing. It might be better to add an accounting mechanism to say
"never go above x dirty pages against block device n" or something of
the sort but you can still get into trouble if you happen to have
hundreds of iSCSI devices each with their own request queue..
--
Matt Mackall : http://www.selenic.com : of or relating to the moon
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 5:55 ` Matt Mackall
@ 2003-03-26 6:31 ` Peter T. Breuer
2003-03-26 6:48 ` Matt Mackall
2003-03-26 6:59 ` Andre Hedrick
2003-03-26 7:31 ` Lincoln Dale
2 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-26 6:31 UTC (permalink / raw)
To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel
"A month of sundays ago Matt Mackall wrote:"
> On Tue, Mar 25, 2003 at 09:40:44PM -0500, Jeff Garzik wrote:
> > Peter T. Breuer wrote:
> > >"Justin Cormack wrote:"
> > >>And I am intending to write an iscsi client sometime, but it got
> > >>delayed. The server stuff is already available from 3com.
> > >
> > >Possibly, but ENBD is designed to fail :-). And networks fail.
> > >What will your iscsi implementation do when somebody resets the
> > >router? All those issues are handled by ENBD. ENBD breaks off and
> > >reconnects automatically. It reacts right to removable media.
> >
> > Yeah, iSCSI handles all that and more. It's a behemoth of a
> > specification. (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
> Indeed, there are iSCSI implementations that do multipath and
> failover.
Somebody really ought to explain it to me :-). I can't keep up with all
this!
> Both iSCSI and ENBD currently have issues with pending writes during
> network outages. The current I/O layer fails to report failed writes
> to fsync and friends.
ENBD has two (configurable) behaviors here. Perhaps it should have
more. By default it blocks pending reads and writes during times when
the connection is down. It can be configured to error them instead. The
erroring behavior is what you want when running under soft RAID, as
it's raid that should do the deciding about how to treat the requests
according to the overall state of the array, so it needs definite
yes/no info on each request, no "maybe".
Perhaps in a third mode requests should be blocked and time out after
about half an hour (or some number, in an infinite spectrum).
What I would like is some way of telling how backed up the VM is
against us. If the VM is full of dirty buffers aimed at us, then
I think we should consider erroring instead of blocking. The problem is
that at that point we're likely not getting any requests at all,
because the kernel long ago ran out of the 256 requests it has in
hand to send us.
There is indeed an information disconnect with VMS in those
circumstances that I've never known how to solve. FSs too are a
problem, because unless they are mounted sync they will happily permit
writes to a file on a fs on a blocked device even if that fills the
machine with buffers that can't go anywhere. Among other things that
will run tcp out of buffer space that is necessary in order to flush
those buffers even if the connection does come back. And even if the
mount is sync then some fs's (e.g. ext2) still allow infinitely much
writing to a blocked device under some circumstances (start two
processes writing to the same file .. the second will write to
buffers).
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-26 2:40 ` Jeff Garzik
@ 2003-03-26 5:55 ` Matt Mackall
2003-03-26 6:31 ` Peter T. Breuer
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Matt Mackall @ 2003-03-26 5:55 UTC (permalink / raw)
To: Jeff Garzik; +Cc: ptb, Justin Cormack, linux kernel
On Tue, Mar 25, 2003 at 09:40:44PM -0500, Jeff Garzik wrote:
> Peter T. Breuer wrote:
> >"Justin Cormack wrote:"
> >>And I am intending to write an iscsi client sometime, but it got
> >>delayed. The server stuff is already available from 3com.
> >
> >
> >Possibly, but ENBD is designed to fail :-). And networks fail.
> >What will your iscsi implementation do when somebody resets the
> >router? All those issues are handled by ENBD. ENBD breaks off and
> >reconnects automatically. It reacts right to removable media.
>
> Yeah, iSCSI handles all that and more. It's a behemoth of a
> specification. (whether a particular implementation implements all that
> stuff correctly is another matter...)
Indeed, there are iSCSI implementations that do multipath and
failover.
Both iSCSI and ENBD currently have issues with pending writes during
network outages. The current I/O layer fails to report failed writes
to fsync and friends.
> BTW, I'm a big enbd fan :) I like enbd for it's _simplicity_ compared
> to iSCSI.
Definitely. The iSCSI protocol is more powerful but _much_ more
complex than ENBD. I've spent two years working on iSCSI but guess
which I use at home..
--
Matt Mackall : http://www.selenic.com : of or relating to the moon
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
2003-03-25 20:53 ` Peter T. Breuer
@ 2003-03-26 2:40 ` Jeff Garzik
2003-03-26 5:55 ` Matt Mackall
[not found] ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
2003-03-28 11:19 ` Pavel Machek
2 siblings, 1 reply; 27+ messages in thread
From: Jeff Garzik @ 2003-03-26 2:40 UTC (permalink / raw)
To: ptb; +Cc: Justin Cormack, linux kernel
Peter T. Breuer wrote:
> "Justin Cormack wrote:"
>>And I am intending to write an iscsi client sometime, but it got
>>delayed. The server stuff is already available from 3com.
>
>
> Possibly, but ENBD is designed to fail :-). And networks fail.
> What will your iscsi implementation do when somebody resets the
> router? All those issues are handled by ENBD. ENBD breaks off and
> reconnects automatically. It reacts right to removable media.
Yeah, iSCSI handles all that and more. It's a behemoth of a
specification. (whether a particular implementation implements all that
stuff correctly is another matter...)
BTW, I'm a big enbd fan :) I like enbd for it's _simplicity_ compared
to iSCSI.
Jeff
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
[not found] <1048623613.25914.14.camel@lotte>
@ 2003-03-25 20:53 ` Peter T. Breuer
2003-03-26 2:40 ` Jeff Garzik
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-25 20:53 UTC (permalink / raw)
To: Justin Cormack; +Cc: linux kernel
"Justin Cormack wrote:"
> On Tue, 2003-03-25 at 17:27, Peter T. Breuer wrote:
> > > ENBD is not a replacement for NBD - the two are alternatives, aimed
> > > at different niches. ENBD is a sort of heavyweight industrial NBD. It
> > > does many more things and has a different achitecture. Kernel NBD is
> > > like a stripped down version of ENBD. Both should be in the kernel.
>
> hmm, I would argue that nbd is ok, as it is a nice lightweight block
> device (though I have not been able to use it due to the fact that I can
> never find a userspace and kernel that work together), while ENBD should
> be replaced by iscsi, now that is a real ietf standard, and can burn CDs
> across the net and all that extra stuff.
It's not a bad idea. But ENBD in particular can use any transport,
precisely becuase its networking is done in userspace. One only has to
write a stream.c module for it that implements
read
write
open
close
(There are currently implementations for three transports, including
tcp of course).
> And I am intending to write an iscsi client sometime, but it got
> delayed. The server stuff is already available from 3com.
Possibly, but ENBD is designed to fail :-). And networks fail.
What will your iscsi implementation do when somebody resets the
router? All those issues are handled by ENBD. ENBD breaks off and
reconnects automatically. It reacts right to removable media.
I should also have said that ENBD has the following features (I said I
forgot some!)
9) it drops into a mode where it md5sums both ends and skips writes
of equal blocks, if that's faster. It moves in and out of this mode
automatically. This helps RAID resyncs (2* overspeed is common on
100BT nets, that is 20MB/s.).
10) integration with RAID - it advises raid correctly of its state
and does hot add and remove correctly (well, you need my patches to
raid, but there you are ...).
Of course, if somebody wants me to make enbd appear like a scis device
instead of a generic block device, I guess I could do that. Except,
that yecch, I have seen the scsi code, and I do not understand it.
Another good idea is to make the wire protocol iscsi compatible. I
have no objection to that.
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-25 17:27 Peter T. Breuer
0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-25 17:27 UTC (permalink / raw)
To: linux kernel
"a litle while ago ptb wrote:"
> Here's a patch to incorporate Enhanced NBD (ENBD) into kernel 2.5.64.
> I'll put the patch on the list first, and then post again with a
> technical breakdown and various arguments/explanations.
I'll now put up the technical discussion I promised. (the patch is
also in the patches/ subdir in the archive at
ftp://oboe.it.uc3m.es/pub/Programs/nbd-2.4.31.tgz)
I'll repeat the dates .. Pavel's kernel NBD, 1997, the ENBD 1998,
derived initially from Pavel's code backported to stable kernels.
Pavel and I have been in contact many times over the years.
Technical differences
---------------------
1) One of the original changes made was technical, and is perhaps
the biggest reason for what incompatibilities there are (I can
equalize the wire formats, but not the functional protocols, so you
need different userspace support for the different kernel drivers).
- kernel nbd runs a single thread transferring data between kernel
and net.
- ENBD runs multiple threads running asynchronously of each other
The result is that ENBD can get a pipelining benefit .. while one
thread is sending to the net another is talking to the kernel and
so on. This shows up in different ways. Obviously you do best if
you have two cpus or two nics, etc.
Also ENBD doesn't die when one thread gets stuck. I'll talk about
that.
2) There is a difference in philosophy, which results in different
code, different behaviors, etc. Basically, ENBD must not /fail/.
It's supposed to keep working first and foremost, and deal with
errors when they crop up, and it's supposed to expect errors.
- kernel nbd runs a full kernel thread which cannot die. It loops
inside the kernel.
- ENBD runs userspace threads which can die and are expected to die
and which are restarted by a master when they die. They only dip
into the kernel occasionally.
This originally arose because I was frustrated with not being able
to kill the kernel nbd client daemon, and thus free up its "space".
It certainly used to start what nowadays we know as a kernel thread,
but from user space. It dove into the kernel in an ioctl and
executed a forever loop there. ENBD doesn't do that. It runs the
daemon cycle from user space via separate ioctls for each stage.
That's why you need different user space utilities.
- kernel nbd has daemons which are quite lightweight
- ENBD has daemons which disconnect if they detect network failures
and reconnect as soon as the net comes up again. Servers and
clients can die, and be restarted, and they'll reconnect, entirely
automatically, all on their little ownsomes ..
ENBD is prepared internally to retract requests from client daemons
which don't respond any longer, and pass them to others instead.
It's tehrefore also prepared to receive acks out fo order, etc. etc.
Another facet of all that is the following:
- kernel nbd does networking from within the kernel
- ENBD does its networking from userspace. It has to, to manage the
complex reconnect handshakes, authentication, brownouts, etc.
As a result, ENBD is much more flexible in its transport protocols.
There is a single code module which implements a "stream", and
the three or four method within need to be reimplementd for each
protocol, but that's all. There are two standard transports in the
distribution code - tcp and ssl, and other transport modules have
been implemented, including ones for very low overhead raw networking
protocols.
OK, I can't think of any more "basic" things at the moment. But ENBD
also suffers from galloping featurism. All the features can be added to
kernel nbd too, of course, but some of them are not point changes at
all! It would take just as long as it took to add them to ENBD in the
first place. I'll make a list ...
Featuritus
----------
1) remote ioctls. ENBD does pass ioctls over the net to the
server. Only the ones it knows about of course, but that's
at least a hundred off. You can eject cdroms over the net.
More ioctls can be added to its list anytime. Well, it knows about
at least 4 different techniques for moving ioctls, and you can
invent more ..
2) support for removable media. Maybe I should have included that in
the technical differnces part. Basically, ENBD expects the
server to report errors that are on-disk, and it distinguishes
them from on-wire errors. It proactively pings both the server, and
asks the server to check its media, every second or so. A change
in an exported floppy is spotted, and the kernel notified.
3) ENBD has a "local write/remote read" mode, which is useful for
replacing NFS root with. A single server can be replicated to
many clients, each of which then makes its own local changes.
The writes stay in memory, of course (this IS a kind of point
change).
4) ENBD has an async mode (well, two), in which no acks are expected
for requests. This is useful for swapping over ENBD (the daemons
also have to fixed in memory for that, and thats's a "-s" flag).
Really, there are several async modes. Either the client doesn't
need to ack the kernel, or can ack it late, or the server doesn't
need to ack the client, etc.
5) ENBD has an evolved accounting and control interface in /proc.
It amounts to about 25% of its code.
6) ENBD supports several sync modes, direct i/o on client, sync
on server, talking to raw devices, etc.
7) ENBD supports partitions.
Maybe there are more features. There are enough that I forget them at
times. I try and split them out into add-on modules. These are things
that have been requested or even requested and funded! So they satisfy
real needs.
Extra badness
-------------
One thing that's obvious is that ENBD has vastly more code than kernel
enbd. Look at these stats:
csize output, enbd vs kernel nbd ..
total blank lines w/ nb, nc semi- preproc. file
lines lines comments lines colons direct.
--------+--------+--------+--------+--------+--------+----
4172 619 800 2789 1438 89 enbd_base.c
405 38 67 304 70 38 enbd_ioctl.c
30 4 3 23 10 4 enbd_ioctl_stub.c
99 13 8 78 34 8 enbd_md.c
1059 134 32 902 447 15 enbd_proc.c
75 8 16 51 20 2 enbd_seqno.c
64 14 5 45 18 2 enbd_speed.c
5943 839 931 4222 2043 167 total
total blank lines w/ nb, nc semi- preproc. file
lines lines comments lines colons direct.
--------+--------+--------+--------+--------+--------+----
631 77 68 487 307 34 nbd.c
You should see that ENBD has between 5 and 10 times as much code as
kernel nbd. I've tried to split things up so that enbd_base.c is
roughly equivalent to kernel nbd, but it still looks that way. But it's
not quite true .. one thing that distorts stats is that ENBD needs many
more trivial support functions just to allow things to be split up! The
extra functions become methods in a struct, and the struct is exported
to the other module, and then the caller uses the method. Pavel was
probably able to just do a straight bitop instead!
Another thing that distorts the stats is the proc interface. Although I
split it out in the code (it's about 1000 of 5000 lines total), the
support functions for its read and writes are still in the main code.
Yes, I could have not written a function and instead embedded the code
directly in the proc interface, but then maintenance would have been
impossible. So that's another reason ...
... because of the extra size of the code, ENBD has many more internal
code interfaces, in order to keep things separated and sane. It would
be unmanagable as a single monolithic lump. You get some idea of that
from the function counts in the following list:
ccount 1.0: NCSS Comnts Funcs Blanks Lines
------------------+-----+-------+------+-------+----
enbd_base.c: 1449 739 71 615 4174
enbd_ioctl.c: 70 59 12 42 409
enbd_ioctl_stub.c: 10 3 3 3 30
enbd_md.c: 34 7 6 13 99
enbd_proc.c: 452 32 16 133 1060
enbd_seqno.c: 20 13 5 8 75
enbd_speed.c: 18 4 2 14 64
Totals: 2059 857 115 837 5950
------------------+-----+-------+------+-------+----
ccount 1.0: NCSS Comnts Funcs Blanks Lines
nbd.c: 314 63 13 75 631
Note that Pavel averages 48 lines per function and I average 51,
so we probably have the same sense of "diffculty". We both comment
at about the same rate too, Pavel 1 in every 10 lines, me 1 in
every 7 lines.
But I know that I have considerable swathes of code that have to be done
inline, because they mess with request struct fields (for the remote
ioctl stuff), and have to complete and reverse the manipulations within
a single routine.
I'll close with what I said earlier ...
> ENBD is not a replacement for NBD - the two are alternatives, aimed
> at different niches. ENBD is a sort of heavyweight industrial NBD. It
> does many more things and has a different achitecture. Kernel NBD is
> like a stripped down version of ENBD. Both should be in the kernel.
Peter
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2003-03-30 20:37 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-26 22:16 [PATCH] ENBD for 2.5.64 Lincoln Dale
2003-03-26 22:56 ` Lars Marowsky-Bree
2003-03-26 23:21 ` Lincoln Dale
-- strict thread matches above, loose matches on Subject: below --
2003-03-26 22:16 Lincoln Dale
2003-03-26 22:32 ` Andre Hedrick
[not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
2003-03-26 23:03 ` Lincoln Dale
2003-03-26 23:39 ` Andre Hedrick
[not found] <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>
2003-03-26 22:02 ` Peter T. Breuer
2003-03-26 23:49 ` Lincoln Dale
2003-03-27 0:08 ` Peter T. Breuer
[not found] <1048623613.25914.14.camel@lotte>
2003-03-25 20:53 ` Peter T. Breuer
2003-03-26 2:40 ` Jeff Garzik
2003-03-26 5:55 ` Matt Mackall
2003-03-26 6:31 ` Peter T. Breuer
2003-03-26 6:48 ` Matt Mackall
2003-03-26 7:05 ` Peter T. Breuer
2003-03-26 6:59 ` Andre Hedrick
2003-03-26 13:58 ` Jeff Garzik
2003-03-26 7:31 ` Lincoln Dale
2003-03-26 9:59 ` Lars Marowsky-Bree
2003-03-26 10:18 ` Andrew Morton
2003-03-26 13:49 ` Jeff Garzik
2003-03-26 16:09 ` Matt Mackall
[not found] ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
2003-03-26 22:40 ` Matt Mackall
2003-03-28 11:19 ` Pavel Machek
2003-03-30 20:48 ` Peter T. Breuer
2003-03-25 17:27 Peter T. Breuer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).