All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
@ 2014-04-04 14:23 Dr. Greg Wettstein
  2014-04-07  8:10 ` Pasi Kärkkäinen
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. Greg Wettstein @ 2014-04-04 14:23 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: Vladislav Bolkhovitin, scst-devel, linux-scsi

On Apr 3,  1:21pm, "Nicholas A. Bellinger" wrote:
} Subject: Re: [Scst-devel] OSS target - VMware SCSI reservation bug conform

Hi Nicholas, thanks for weighing in on this issue.

> > I had the following reasons for raising the issue:
> > 
> > 	1.) Does anyone in the open-source storage eco-system,
> > 	    ie. SCST/LIO whatever, have any confirmation that this
> > 	    is a known issue.
> > 

> FYI guys, AFAICT this bug is specific to targets that don't support
> VAAI AtomicTestandSet (COMPARE_AND_WRITE), and need to use the
> legacy SCSI-2 reservations instead.

Which, at this point in time, is probably a significant percentage of
the SAN's, both commercial and open-source based, which are feeding
storage to ESXi initiators.

> When AtomicTestandSet is available, ESX will avoid using
> reservations to lock the whole LUN and obtain exclusive access to
> individual VMFS extent on a per node basis instead.

With subsequent performance advantages as well.

ATS is obviously the path forward for multiple reasons.  Unfortunately
I have heard from multiple sources that vendors have already written
their VAAI implementations to be conformant with how ESXi works rather
then the letter of the standard.  Which potentially places us back
into the situation we are with SCSI reservations.

Obviously Dell/EqualLogix addresses this issue or a similar problem
with their 6.0.6H2 firmware and VMware is urging people to upgrade.
That may be a EqualLogix controller error fix in which case it isn't
relevant to those of in this community.

If, on the other hand, there is something which can be done on the
target side to remediate the condition it is to the obvious advantage
of both LIO and SCST to know what that might be.  The question isn't
about advertised feature sets the question is the reality of how any
of this technology works in the field.

The case we are looking at is a perfect example.  The ESXi initiators
retried I/O's several hundred times in the last year, I looked up
everyone one of them in the logs.  Reality isn't standards compliance
but what happens when systems are running at 380 megabytes/second
throughput and something wonky happens and triggers an edge case.

Anyone who knows storage knows that storage managers/architects are
legendary for running firmware or code releases that 'work'.  Better
the devil you know then the devil you don't.  I suspect that will be a
constraint for a long time with respect to rushing into ATS/VAAI
version 1.0 implementations.

Quite frankly the other reality is that if someone does know how to
fix this on the target side they probably are not going to talk about
it, competitive advantage and all that.....

> --nab

Have a good weekend.

Greg

}-- End of excerpt from "Nicholas A. Bellinger"

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"A computer without Windows is like a fish without a bicycle."
                                -- Chris Woods

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
  2014-04-04 14:23 [Scst-devel] OSS target - VMware SCSI reservation bug conformity Dr. Greg Wettstein
@ 2014-04-07  8:10 ` Pasi Kärkkäinen
  0 siblings, 0 replies; 8+ messages in thread
From: Pasi Kärkkäinen @ 2014-04-07  8:10 UTC (permalink / raw)
  To: greg; +Cc: Nicholas A. Bellinger, Vladislav Bolkhovitin, scst-devel, linux-scsi

On Fri, Apr 04, 2014 at 09:23:20AM -0500, Dr. Greg Wettstein wrote:
> 
> Obviously Dell/EqualLogix addresses this issue or a similar problem
> with their 6.0.6H2 firmware and VMware is urging people to upgrade.
> That may be a EqualLogix controller error fix in which case it isn't
> relevant to those of in this community.
>

Just a note that 6.0.6H2 hotfix is already an old version, there have been multiple EQL firmware releases after that.

-- Pasi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
  2014-04-03 20:21 ` Nicholas A. Bellinger
@ 2014-04-04 14:39   ` James Bottomley
  0 siblings, 0 replies; 8+ messages in thread
From: James Bottomley @ 2014-04-04 14:39 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: greg, Vladislav Bolkhovitin, scst-devel, linux-scsi

On Thu, 2014-04-03 at 13:21 -0700, Nicholas A. Bellinger wrote:
> On Wed, 2014-04-02 at 15:29 -0500, Dr. Greg Wettstein wrote:
> > On Mar 28,  8:53pm, Vladislav Bolkhovitin wrote:
> > } Subject: Re: [Scst-devel] OSS target - VMware SCSI reservation bug conform
> > 
> > > Dr. Greg Wettstein, on 03/27/2014 11:21 AM wrote:
> > > > Hi, hope the week is going well for everyone.
> > > >
> > > > There appears to be evidence that VMware has an issue with exact SCSI
> > > > standards compliance when it comes to handling corner cases with SCSI
> > > > reservation requests.  It appears as if Dell is pushing firmware hot
> > > > fixes for the EqualLogic controllers to work around the issue.
> > 
> > > Hi Greg,
> > 
> > Hi Vlad, thanks for taking the time to respond.
> > 
> > > That's interesting, but, unfortunately, your message doesn't contain
> > > sufficient technical details to look at this issue, if it exists. Or
> > > do you think we are magicians who can read minds and see through
> > > walls? ;)
> > 
> > Actually I did, but I assumed a maintenance contract would be needed
> > for that.
> > 
> > I had the following reasons for raising the issue:
> > 
> > 	1.) Does anyone in the open-source storage eco-system,
> > 	    ie. SCST/LIO whatever, have any confirmation that this
> > 	    is a known issue.
> > 
> 
> FYI guys, AFAICT this bug is specific to targets that don't support VAAI
> AtomicTestandSet (COMPARE_AND_WRITE), and need to use the legacy SCSI-2
> reservations instead.

This would be a significant problem.  SCSI-2 reservation holders have to
be aware of resets and reapply the reservation accordingly.  When I
worked at SteelEye, we used a reservation ping and reset detection
mechanism for this ... of course, that was before we switched to SCSI-3
reservations in 2005 ... why is VMware still using legacy SCSI-2?

The "bug" seems to be that VMware is using legacy reservations but not
maintaining them properly.  There's not much anyone can do to fix that.
SCSI-2 reservations have to be dropped on reset and if no-one maintains
them, they can't get magically reapplied because a reset is the way they
get broken for a dead node.  SCSI-3 reservations are reset immune, so
why isn't VMware using them (that's why SteelEye switched, for the
predictability in the face of fabric problems)?

James

> When AtomicTestandSet is available, ESX will avoid using reservations to
> lock the whole LUN and obtain exclusive access to individual VMFS extent
> on a per node basis instead.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
  2014-04-02 20:29 Dr. Greg Wettstein
@ 2014-04-03 20:21 ` Nicholas A. Bellinger
  2014-04-04 14:39   ` James Bottomley
  0 siblings, 1 reply; 8+ messages in thread
From: Nicholas A. Bellinger @ 2014-04-03 20:21 UTC (permalink / raw)
  To: greg; +Cc: Vladislav Bolkhovitin, scst-devel, linux-scsi

On Wed, 2014-04-02 at 15:29 -0500, Dr. Greg Wettstein wrote:
> On Mar 28,  8:53pm, Vladislav Bolkhovitin wrote:
> } Subject: Re: [Scst-devel] OSS target - VMware SCSI reservation bug conform
> 
> > Dr. Greg Wettstein, on 03/27/2014 11:21 AM wrote:
> > > Hi, hope the week is going well for everyone.
> > >
> > > There appears to be evidence that VMware has an issue with exact SCSI
> > > standards compliance when it comes to handling corner cases with SCSI
> > > reservation requests.  It appears as if Dell is pushing firmware hot
> > > fixes for the EqualLogic controllers to work around the issue.
> 
> > Hi Greg,
> 
> Hi Vlad, thanks for taking the time to respond.
> 
> > That's interesting, but, unfortunately, your message doesn't contain
> > sufficient technical details to look at this issue, if it exists. Or
> > do you think we are magicians who can read minds and see through
> > walls? ;)
> 
> Actually I did, but I assumed a maintenance contract would be needed
> for that.
> 
> I had the following reasons for raising the issue:
> 
> 	1.) Does anyone in the open-source storage eco-system,
> 	    ie. SCST/LIO whatever, have any confirmation that this
> 	    is a known issue.
> 

FYI guys, AFAICT this bug is specific to targets that don't support VAAI
AtomicTestandSet (COMPARE_AND_WRITE), and need to use the legacy SCSI-2
reservations instead.

When AtomicTestandSet is available, ESX will avoid using reservations to
lock the whole LUN and obtain exclusive access to individual VMFS extent
on a per node basis instead.

--nab

> 	2.) To alert other open-source storage users/vendors that, at
> 	    least from our experience, it appears as if the problem
> 	    may be rare but legitimate.
> 
> 	3.) To determine, if the bug could be found, whether things
> 	    like mode pages would make sense in an open-source stack
> 	    to address issues such as this.
> 
> I've had a fair amount of private feedback that there is a good chance
> the issue may be legitimate.  I've also had feedback that there may be
> other issues with VMware 'corner-case' behavior.  Given the nature of
> the VAAI extensions/primitives rolling out I would anticipate that to
> be a fertile area for these types of issues as well.
> 
> I'm not even sure, given the nature of the issue, if it could be
> tracked down but everyone who is interested in the issue can now be
> looking for it.
> 
> Here is the essence of what we have to work with, redacted due to the
> volume of messages, to sentinel events.
> 
> SDS proxy: ----------------------------------------------------------------
> Mar 19 21:50:30 PROXY kernel: rport-3:0-0: blocked FC remote port time out: removing target and saving binding
> Mar 19 21:50:30 PROXY kernel: sd 3:0:0:0: rejecting I/O to offline device
> Mar 19 21:50:32 PROXY kernel: qla2xxx [0000:04:00.1]-8009:3: DEVICE RESET ISSUED nexus=3:0:0 cmd=da751240.
> ..
> .. Noise from Qlogic adapter doing DTB reset.
> ..
> Mar 19 21:50:40 PROXY kernel: qla2xxx [0000:04:00.1]-8018:3: ADAPTER RESET ISSUED nexus=3:0:0.
> ..
> .. More noise from the Qlogic adapater.
> ..
> Mar 19 21:55:43 PROXY kernel: qla2xxx [0000:04:00.1]-8017:3: ADAPTER RESET SUCCEEDED nexus=3:0:0.
> ---------------------------------------------------------------------------
> 
> VMware logs: --------------------------------------------------------------
> Mar 19 21:49:08 VMWARE1 2014-03-20T02:49:08.592Z VMWARE1 vmkernel: cpu2:8432)<6>qla2xxx 0000:42:00.0: scsi(8:0:1): Abort command succeeded -- 1 1247096017.
> Mar 19 21:49:12 VMWARE1 2014-03-20T02:49:12.664Z VMWARE1 vobd:  [vmfsCorrelator] 11527431227280us: [esx.problem.vmfs.heartbeat.timedout] 52f3b635-9ea13918-af49-bc305bee68bc VOLUMENAME
> Mar 19 21:49:14 VMWARE1 2014-03-20T02:49:12.670Z VMWARE1 Hostd: [24FDEB90 info 'Vimsvc.ha-eventmgr'] Event 466 : Lost access to volume 52f3b635-9ea13918-af49-bc305bee68bc (VOLUMENAME) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
> ..
> .. Noise from VMWARE hosts about wanting path updates - only single path
> .. to proxy, heartbeat timeouts etc.
> ..
> Mar 19 21:49:32 VMWARE1 2014-03-20T02:49:32.911Z VMWARE1 vmkernel: cpu10:8202)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T0:L1, reservation state on device eui.6665356665393330 is unknown.
> ..
> .. More noise from VMWARE hosts, additional reservation failures etc.
> ..
> Mar 19 21:50:32 VMWARE2 2014-03-20T02:50:30.554Z VMWARE2 vmkwarning: cpu22:8237)WARNING: HBX: 564: Volume 52f3b635-9ea13918-af49-bc305bee68bc ("VOLUMENAME") may be damaged on disk. Corrupt heartbeat detected at offset 3653632: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000
> ---------------------------------------------------------------------------
> 
> And at that point things were pretty much over with.
> 
> I would certainly be open to suggestions on how to track or obtain
> useful information for you.  The SDS proxy was sustaining about 350
> megabytes/second of I/O from seven initiators so I don't think turning
> on target mode debugging and cmd tracing is much of an option.
> 
> > Thanks,
> > Vlad
> 
> Have a good weekend.
> 
> Greg
> 
> }-- End of excerpt from Vladislav Bolkhovitin
> 
> As always,
> Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
> 4206 N. 19th Ave.           Specializing in information infra-structure
> Fargo, ND  58102            development.
> PH: 701-281-1686
> FAX: 701-281-3949           EMAIL: greg@enjellic.com
> ------------------------------------------------------------------------------
> "After being a technician for 2 years, I've discovered if people took
>  care of their health with the same reckless abandon as their computers,
>  half would be at the kitchen table on the phone with the hospital, trying
>  to remove their appendix with a butter knife."
>                                 -- Brian Jones
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scst-devel mailing list
> https://lists.sourceforge.net/lists/listinfo/scst-devel



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
@ 2014-04-02 20:29 Dr. Greg Wettstein
  2014-04-03 20:21 ` Nicholas A. Bellinger
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. Greg Wettstein @ 2014-04-02 20:29 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: scst-devel, linux-scsi

On Mar 28,  8:53pm, Vladislav Bolkhovitin wrote:
} Subject: Re: [Scst-devel] OSS target - VMware SCSI reservation bug conform

> Dr. Greg Wettstein, on 03/27/2014 11:21 AM wrote:
> > Hi, hope the week is going well for everyone.
> >
> > There appears to be evidence that VMware has an issue with exact SCSI
> > standards compliance when it comes to handling corner cases with SCSI
> > reservation requests.  It appears as if Dell is pushing firmware hot
> > fixes for the EqualLogic controllers to work around the issue.

> Hi Greg,

Hi Vlad, thanks for taking the time to respond.

> That's interesting, but, unfortunately, your message doesn't contain
> sufficient technical details to look at this issue, if it exists. Or
> do you think we are magicians who can read minds and see through
> walls? ;)

Actually I did, but I assumed a maintenance contract would be needed
for that.

I had the following reasons for raising the issue:

	1.) Does anyone in the open-source storage eco-system,
	    ie. SCST/LIO whatever, have any confirmation that this
	    is a known issue.

	2.) To alert other open-source storage users/vendors that, at
	    least from our experience, it appears as if the problem
	    may be rare but legitimate.

	3.) To determine, if the bug could be found, whether things
	    like mode pages would make sense in an open-source stack
	    to address issues such as this.

I've had a fair amount of private feedback that there is a good chance
the issue may be legitimate.  I've also had feedback that there may be
other issues with VMware 'corner-case' behavior.  Given the nature of
the VAAI extensions/primitives rolling out I would anticipate that to
be a fertile area for these types of issues as well.

I'm not even sure, given the nature of the issue, if it could be
tracked down but everyone who is interested in the issue can now be
looking for it.

Here is the essence of what we have to work with, redacted due to the
volume of messages, to sentinel events.

SDS proxy: ----------------------------------------------------------------
Mar 19 21:50:30 PROXY kernel: rport-3:0-0: blocked FC remote port time out: removing target and saving binding
Mar 19 21:50:30 PROXY kernel: sd 3:0:0:0: rejecting I/O to offline device
Mar 19 21:50:32 PROXY kernel: qla2xxx [0000:04:00.1]-8009:3: DEVICE RESET ISSUED nexus=3:0:0 cmd=da751240.
..
.. Noise from Qlogic adapter doing DTB reset.
..
Mar 19 21:50:40 PROXY kernel: qla2xxx [0000:04:00.1]-8018:3: ADAPTER RESET ISSUED nexus=3:0:0.
..
.. More noise from the Qlogic adapater.
..
Mar 19 21:55:43 PROXY kernel: qla2xxx [0000:04:00.1]-8017:3: ADAPTER RESET SUCCEEDED nexus=3:0:0.
---------------------------------------------------------------------------

VMware logs: --------------------------------------------------------------
Mar 19 21:49:08 VMWARE1 2014-03-20T02:49:08.592Z VMWARE1 vmkernel: cpu2:8432)<6>qla2xxx 0000:42:00.0: scsi(8:0:1): Abort command succeeded -- 1 1247096017.
Mar 19 21:49:12 VMWARE1 2014-03-20T02:49:12.664Z VMWARE1 vobd:  [vmfsCorrelator] 11527431227280us: [esx.problem.vmfs.heartbeat.timedout] 52f3b635-9ea13918-af49-bc305bee68bc VOLUMENAME
Mar 19 21:49:14 VMWARE1 2014-03-20T02:49:12.670Z VMWARE1 Hostd: [24FDEB90 info 'Vimsvc.ha-eventmgr'] Event 466 : Lost access to volume 52f3b635-9ea13918-af49-bc305bee68bc (VOLUMENAME) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
..
.. Noise from VMWARE hosts about wanting path updates - only single path
.. to proxy, heartbeat timeouts etc.
..
Mar 19 21:49:32 VMWARE1 2014-03-20T02:49:32.911Z VMWARE1 vmkernel: cpu10:8202)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T0:L1, reservation state on device eui.6665356665393330 is unknown.
..
.. More noise from VMWARE hosts, additional reservation failures etc.
..
Mar 19 21:50:32 VMWARE2 2014-03-20T02:50:30.554Z VMWARE2 vmkwarning: cpu22:8237)WARNING: HBX: 564: Volume 52f3b635-9ea13918-af49-bc305bee68bc ("VOLUMENAME") may be damaged on disk. Corrupt heartbeat detected at offset 3653632: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000
---------------------------------------------------------------------------

And at that point things were pretty much over with.

I would certainly be open to suggestions on how to track or obtain
useful information for you.  The SDS proxy was sustaining about 350
megabytes/second of I/O from seven initiators so I don't think turning
on target mode debugging and cmd tracing is much of an option.

> Thanks,
> Vlad

Have a good weekend.

Greg

}-- End of excerpt from Vladislav Bolkhovitin

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"After being a technician for 2 years, I've discovered if people took
 care of their health with the same reckless abandon as their computers,
 half would be at the kitchen table on the phone with the hospital, trying
 to remove their appendix with a butter knife."
                                -- Brian Jones

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
  2014-03-28  5:01 Dr. Greg Wettstein
@ 2014-03-28 18:34 ` Dale R. Worley
  0 siblings, 0 replies; 8+ messages in thread
From: Dale R. Worley @ 2014-03-28 18:34 UTC (permalink / raw)
  To: linux-scsi

> From: "Dr. Greg Wettstein" <greg@wind.enjellic.com>

> If there is an issue it would seem to be in the best interests of
> those of us committed to open-source storage solutions to understand
> and protect ourselves from the situation.  There is a third saying
> which is important as well:

If the question is a legitimate question of interpretation, then the
better course is probably "be liberal in what you receive and be
conservative in what you transmit".

Practically speaking, I remember the story of US Robotics modems.  I'm
told they captured the market for server modems in the days of dialup
services by testing their modems against every make of modem on the
market and tweaking its behavior so that it could successfully
interwork with basically any modem the consumer purchased, no matter
how crappy.  This level of reliability was very valuable to the
operators of online services.

Dale

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
@ 2014-03-28  5:01 Dr. Greg Wettstein
  2014-03-28 18:34 ` Dale R. Worley
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. Greg Wettstein @ 2014-03-28  5:01 UTC (permalink / raw)
  To: Tommy Apel; +Cc: scst-devel, linux-scsi

On Mar 27,  9:22pm, Tommy Apel wrote:
} Subject: Re: [Scst-devel] OSS target - VMware SCSI reservation bug conform

Good morning, hope the end of the week is going well for everyone.

> 2014-03-27 19:21 GMT+01:00 Dr. Greg Wettstein <greg@wind.enjellic.com>:
> > Hi, hope the week is going well for everyone.
> >
> > There appears to be evidence that VMware has an issue with exact SCSI
> > standards compliance when it comes to handling corner cases with SCSI
> > reservation requests.  It appears as if Dell is pushing firmware hot
> > fixes for the EqualLogic controllers to work around the issue.

> Hello lists, excuse me for being a little blunt here, but if what
> Dr.  Greg is telling here is correct, shouldn't it be vmware that
> fixed their BUG in their software rather that everybody else asking
> "how high" when vmware says jump ?
>
> I mean, implementing non-standard things to a standard compliant
> stack is sort of the wrong path to take I should think, the fact
> that vmware has a but in their software, closed source I might add,
> they should fix it not the OSS community.
>
> Maybe I'm wrong here, but I believe that standards were made so that
> everybody could implement a look-the-same / feel-the-same interface
> instead of inventing the wheel over and over again for every brand
> known to man kind.

Very understandable sentiments and ones which I certainly sympathize
with.

I've been doing this stuff for a long time and unfortunately I believe
this is a situation where the old meme applies:

	"The wonderful thing about standards is that there are so many
	 to choose from."

I would add to that:

	"Everyone should have their own version of their favorite
	 standard."

Every standard ever written has issues with respect to behavior on
edge cases.  I believe VMware's position is that it has implemented
the standard properly and if there is a bug it isn't their issue.

If there is an issue it is obviously an edge case given that it
appears to be quite rare.  I've heard suggestions that the possible
vulnerability window amounts to about 12 seconds a day.

Whatever the reality of the regression, given the litiginous nature of
our society I think it is highly unlikely for anyone to admit they
have a bug in terms of standards implementation, particularly in the
storage industry.  To do so would open one to inevitable data loss
litigation.

I'm not a fan of 'tuning' standards implemenations but mode pages seem
to be a reality in the industry and there are ample practical reasons
for them.  The other reality is that VMware/EMC is a way bigger
gorilla then the open-source storage stacks whether they be LIO, SCST
or anything else.

If there is an issue it would seem to be in the best interests of
those of us committed to open-source storage solutions to understand
and protect ourselves from the situation.  There is a third saying
which is important as well:

	"No one ever got fired for buying vendor approved storage."

> /Tommy

Have a good weekend.

Greg

}-- End of excerpt from Tommy Apel

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"God made man, the appendix was the result of a committee."
                                -- Dr. G.W. Wettstein
                                   Guerrilla Tactics for Corporate Survival

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
  2014-03-27 18:21 Dr. Greg Wettstein
@ 2014-03-27 20:22 ` Tommy Apel
  0 siblings, 0 replies; 8+ messages in thread
From: Tommy Apel @ 2014-03-27 20:22 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-scsi

2014-03-27 19:21 GMT+01:00 Dr. Greg Wettstein <greg@wind.enjellic.com>:
> Hi, hope the week is going well for everyone.
>
> There appears to be evidence that VMware has an issue with exact SCSI
> standards compliance when it comes to handling corner cases with SCSI
> reservation requests.  It appears as if Dell is pushing firmware hot
> fixes for the EqualLogic controllers to work around the issue.
>
> We may have actually caught this one in the wild with SCST.  I'm
> including the linux-scsi list since it may affect any target code
> which is written strictly to the SCSI standards.  Dell appears to be
> handling it with a custom mode page and if the rumor is true it would
> seem the OSS targets may need to consider something similar given the
> importance of VMware as a client.
>
> VMware was being fed storage from a RAID1 mirror on a software defined
> storage (SDS) appliance based on SCST.  The two RAID1 block devices
> were being supplied from two geographically isolated data-centers.  So
> technically VMware should not see an I/O error as long as the RAID1
> layer is running properly, and none of the VMware initiators did.
>
> All target systems were the top of the SCST 2.2.x tree.  The in-kernel
> Qlogic target driver was being used along with our SCST/Qlogic
> interface driver.  The SDS node was connected with 4 GBPS FC into a
> Nexus 5500 which fed a Nexus 7010 which linked to the remote
> data-center through a 20 GBPS FCOE ISL link to a Nexus 7009 which
> downstreamed into another Nexus 5500 and then into the backing target
> with 8 GBPS fibre-channel.
>
> One data-center took a hit which instantly knocked out one of the
> RAID1 devices.  The Qlogic card talking to that data-center went into
> a DTB nexus reset followed by a full adapter reset.
>
> That caused the VMware initiators to begin to timeout and abort
> I/O's.  The relative timeline was as follows:
>
>         00:00:00 ->     Qlogic adapter reset.
>
>         00:00:02 ->     VMware Qlogic I/O abort succeeded.
>
>         00:00:24 ->     VMware SCSI cmd RESERVE failed.
>
>         00:01:00 ->     VMware corrupt heartbeat detected.
>
> So it was all over with, except for the restores from tape, in about 1
> minute... :-(
>
> The storage system is obviously designed for high availability and has
> seen hundreds of aborted I/O's by the VMware initiators due to wide
> area fabric issues and the like.  The SDS proxy had been running for
> almost two years with no issues so we obviously hit some edge case in
> this instance.
>
> There were nine other big LUN's being fed from the SDS node to
> non-VMware initiators and no issues were noted on any of those so the
> regression appears to be tied specifically to VMware.  Some of the
> 'rumors' floating around is that the SCSI reservation regression is
> linked to aborted I/O's during a tight race window so that would add
> additional credence to the notion we provoked this issue.
>
> I'm assuming if there is the chance to fix this at the target level
> there has to be interest within the community.  There isn't a lot
> which can be done to protect an installation, other then hot snapshots
> at the SDS proxy level, since one has to pretty much trust initiators
> to 'do the right thing' which is of course always an issue in
> SCSI-land, particulary with clustered filesystem locking.
>
> We would be interested in any thoughts/reflections that people might
> have.
>
> Have a good remainder of the week.
>
> As always,
> Dr. G.W. Wettstein, Ph.D.   IDfusion.org
> 4206 N. 19th Ave.           Unified health identity architecture.
> Fargo, ND  58102
> PH: 701-281-1686
> FAX: 701-281-3949           EMAIL: greg@idfusion.org
> ------------------------------------------------------------------------------
> "Man, despite his artistic pretensions, his sophistication and many
>  accomplishments, owes the fact of his existence to a six-inch layer of
>  topsoil and the fact that it rains."
>                                 -- Anonymous writer on perspective.
>                                    GAUSSIAN quote.
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scst-devel mailing list
> https://lists.sourceforge.net/lists/listinfo/scst-devel

Hello lists, excuse me for being a little blunt here, but if what Dr.
Greg is telling here is correct, shouldn't it be vmware that fixed
their BUG in their software rather that everybody else asking "how
high" when vmware says jump ?
I mean, implementing non-standard things to a standard compliant stack
is sort of the wrong path to take I should think, the fact that vmware
has a but in their software, closed source I might add, they should
fix it not the OSS community.

Maybe I'm wrong here, but I believe that standards were made so that
everybody could implement a look-the-same / feel-the-same interface
instead of inventing the wheel over and over again for every brand
known to man kind.

-- 

/Tommy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-07  8:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-04 14:23 [Scst-devel] OSS target - VMware SCSI reservation bug conformity Dr. Greg Wettstein
2014-04-07  8:10 ` Pasi Kärkkäinen
  -- strict thread matches above, loose matches on Subject: below --
2014-04-02 20:29 Dr. Greg Wettstein
2014-04-03 20:21 ` Nicholas A. Bellinger
2014-04-04 14:39   ` James Bottomley
2014-03-28  5:01 Dr. Greg Wettstein
2014-03-28 18:34 ` Dale R. Worley
2014-03-27 18:21 Dr. Greg Wettstein
2014-03-27 20:22 ` [Scst-devel] " Tommy Apel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.