RE: Summary of the Multi-Path BOF at OLS and future directions

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: Summary of the Multi-Path BOF at OLS and future directions
@ 2003-08-08 12:13 jansen, frank
  2003-08-08 12:15 ` Christoph Hellwig
  2003-08-08 12:21 ` Josef Möllers
  0 siblings, 2 replies; 15+ messages in thread
From: jansen, frank @ 2003-08-08 12:13 UTC (permalink / raw)
  To: 'Josef Möllers', Christoph Hellwig
  Cc: James Bottomley, SCSI Mailing List

> 
> Christoph Hellwig wrote:
> 
> > > 4. Configuration of this solution would be extremely 
> important.  The
> > > idea here is to rely on the udev solution currently 
> making its way into
> > > the kernel and essentially have a vendor specific multi-path
> > > configuration as a udev plug-in.
> > >
> > > 5. Vendor value add for specific devices could be encoded both as
> > > configuration (udev) pieces and plug-ins to the upper 
> layer multi-path
> > > driver to activate any proprietary vendor specific 
> configuration options
> > > that may be needed for specific solutions.
> > 
> > What are examples of such value add?
> 
> Some older EMC RAID boxes need a special "TRESPASS" command 
> (implemented
> as a MODE SELECT) to switch the ownership of a LUN from one path
> ("Storage Processor") to the other. The newer ones do this
> automatically, though.

The delineation between the EMC boxes that require vs. not require
a "TRESPASS" command lies not with their age, but rather their family.
The CLARiiON family of Storage Arrays require a "TRESPASS" command to
move a LUN from one redundant Storage Processor (SP) to another.  On the
other hand, Symmetrix Storage Arrays do not use any such behavior.  In
a nutshell, the Symmetrix is active on all paths, on which the LUN is
presented, whereas the CLARiiON is active on one SP and passive on the
other SP.

> Without the TRESPASS, you'd see the LUN (INQUIRY), but you can't do
> anything with it, even a simple READ CAPACITY fails.

This is correct.  Note, that it is not necessarily desirable to TRESPASS
the LUN, as there may be an active path from the other SP to this or
another HBA on the host system.

> -- 
> Josef Möllers (Pinguinpfleger bei FSC)
> 	If failure had no penalty success would not be a prize
> 						-- T.  Pratchett
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-08 12:13 Summary of the Multi-Path BOF at OLS and future directions jansen, frank
@ 2003-08-08 12:15 ` Christoph Hellwig
  2003-08-08 12:21 ` Josef Möllers
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2003-08-08 12:15 UTC (permalink / raw)
  To: jansen, frank
  Cc: 'Josef Möllers',
	Christoph Hellwig, James Bottomley, SCSI Mailing List

On Fri, Aug 08, 2003 at 08:13:11AM -0400, jansen, frank wrote:
> The delineation between the EMC boxes that require vs. not require
> a "TRESPASS" command lies not with their age, but rather their family.
> The CLARiiON family of Storage Arrays require a "TRESPASS" command to
> move a LUN from one redundant Storage Processor (SP) to another.  On the
> other hand, Symmetrix Storage Arrays do not use any such behavior.  In
> a nutshell, the Symmetrix is active on all paths, on which the LUN is
> presented, whereas the CLARiiON is active on one SP and passive on the
> other SP.

So it's rather DG arrays vs EMC ones, ok.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-08 12:13 Summary of the Multi-Path BOF at OLS and future directions jansen, frank
  2003-08-08 12:15 ` Christoph Hellwig
@ 2003-08-08 12:21 ` Josef Möllers
  1 sibling, 0 replies; 15+ messages in thread
From: Josef Möllers @ 2003-08-08 12:21 UTC (permalink / raw)
  To: jansen, frank; +Cc: Christoph Hellwig, James Bottomley, SCSI Mailing List

"jansen, frank" wrote:

> > Without the TRESPASS, you'd see the LUN (INQUIRY), but you can't do
> > anything with it, even a simple READ CAPACITY fails.
> 
> This is correct.  Note, that it is not necessarily desirable to TRESPASS
> the LUN, as there may be an active path from the other SP to this or
> another HBA on the host system.

With MultiPath it _is_ necessary to TRESPASS a CLARiiON box if each SP
is conected to a seperate path.

-- 
Josef Möllers (Pinguinpfleger bei FSC)
	If failure had no penalty success would not be a prize
						-- T.  Pratchett
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-08 12:28 jansen, frank
@ 2003-08-08 13:27 ` Josef Möllers
  0 siblings, 0 replies; 15+ messages in thread
From: Josef Möllers @ 2003-08-08 13:27 UTC (permalink / raw)
  To: jansen, frank; +Cc: Christoph Hellwig, James Bottomley, SCSI Mailing List

"jansen, frank" wrote:
> 
> > Josef Möllers wrote
> >
> > With MultiPath it _is_ necessary to TRESPASS a CLARiiON box if each SP
> > is conected to a seperate path.
> >
> To access the non-active path, this is correct.  What I was trying to say
> is that the MultiPath layer should be somewhat intelligent about this and
> be aware that there may be another active path.  It is much quicker to go
> down an active path than trespass a LUN and then do the I/O.  The other
> part is the risk of excessive trespassing, where a LUN just gets bounced
> back and forth between SPs for each I/O; this is an absolute worst case
> that would bring performance to a standstill.

Obviously.
What we did was to do a TRESPASS only if a command failed with certain
sense data:
ILLEGAL REQUEST, LUN NOT READY, CAUSE NOT REPORTABLE.

Have a nice weekend
-- 
Josef Möllers (Pinguinpfleger bei FSC)
	If failure had no penalty success would not be a prize
						-- T.  Pratchett
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Summary of the Multi-Path BOF at OLS and future directions
@ 2003-08-08 12:28 jansen, frank
  2003-08-08 13:27 ` Josef Möllers
  0 siblings, 1 reply; 15+ messages in thread
From: jansen, frank @ 2003-08-08 12:28 UTC (permalink / raw)
  To: 'Josef Möllers', jansen, frank
  Cc: Christoph Hellwig, James Bottomley, SCSI Mailing List

> Josef Möllers wrote
> 
> With MultiPath it _is_ necessary to TRESPASS a CLARiiON box if each SP
> is conected to a seperate path.
> 
To access the non-active path, this is correct.  What I was trying to say
is that the MultiPath layer should be somewhat intelligent about this and
be aware that there may be another active path.  It is much quicker to go
down an active path than trespass a LUN and then do the I/O.  The other
part is the risk of excessive trespassing, where a LUN just gets bounced
back and forth between SPs for each I/O; this is an absolute worst case 
that would bring performance to a standstill.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-07 16:20 ` Christoph Hellwig
  2003-08-07 23:54   ` Tim Pepper
@ 2003-08-08  6:45   ` Josef Möllers
  1 sibling, 0 replies; 15+ messages in thread
From: Josef Möllers @ 2003-08-08  6:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: James Bottomley, SCSI Mailing List

Christoph Hellwig wrote:

> > 4. Configuration of this solution would be extremely important.  The
> > idea here is to rely on the udev solution currently making its way into
> > the kernel and essentially have a vendor specific multi-path
> > configuration as a udev plug-in.
> >
> > 5. Vendor value add for specific devices could be encoded both as
> > configuration (udev) pieces and plug-ins to the upper layer multi-path
> > driver to activate any proprietary vendor specific configuration options
> > that may be needed for specific solutions.
> 
> What are examples of such value add?

Some older EMC RAID boxes need a special "TRESPASS" command (implemented
as a MODE SELECT) to switch the ownership of a LUN from one path
("Storage Processor") to the other. The newer ones do this
automatically, though.
Without the TRESPASS, you'd see the LUN (INQUIRY), but you can't do
anything with it, even a simple READ CAPACITY fails.
-- 
Josef Möllers (Pinguinpfleger bei FSC)
	If failure had no penalty success would not be a prize
						-- T.  Pratchett
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-07 16:20 ` Christoph Hellwig
@ 2003-08-07 23:54   ` Tim Pepper
  2003-08-08  6:45   ` Josef Möllers
  1 sibling, 0 replies; 15+ messages in thread
From: Tim Pepper @ 2003-08-07 23:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: James Bottomley, SCSI Mailing List

On Thu 07 Aug at 17:20:28 +0100 hch@infradead.org done said:
> > 
> > 5. Vendor value add for specific devices could be encoded both as
> > configuration (udev) pieces and plug-ins to the upper layer multi-path
> > driver to activate any proprietary vendor specific configuration options
> > that may be needed for specific solutions.
> 
> What are examples of such value add?

Personally I'm skeptical that there is much need for plugins, but my employer
really wants to find ways to sell software licenses for advanced
features with their hardware.  They've never defined any advanced features
in terms that weren't marketing speak.  :|

One thing I'm curious about is where non-active-active targets would
be manually failed over to their alternate controller (if they aren't
able to automatically)?  Depending on how config happens, a pluggable
interface for path selection (or error recovery path selection) _and_
configuration may be needed to handle controller failover in the right
way for these devices.  Hopefully this sort of device is going away
though as higher end storage features are commoditised.

At config time there's obviously a need for vendor specific bits since
everybody's doing their inquiry data differently and there's no standard
way to know a device is active-active or if not which is the preferred
controller for a given lun and whether the device is doing automatic
or manual volume transfer on a controller failure.  I for one wouldn't
want to try maintaining a central list of this stuff if the vendors are
willing to do it individually.

t.

-- 
*********************************************************
*  tpepper@vato dot org             * Venimus, Vidimus, *
*  http://www.vato.org/~tpepper     * Dolavimus         *
*********************************************************

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-05  3:54 James Bottomley
  2003-08-05 16:48 ` Alan Cox
  2003-08-06  0:14 ` Patrick Mansfield
@ 2003-08-07 16:20 ` Christoph Hellwig
  2003-08-07 23:54   ` Tim Pepper
  2003-08-08  6:45   ` Josef Möllers
  2 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2003-08-07 16:20 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

On Mon, Aug 04, 2003 at 08:54:55PM -0700, James Bottomley wrote:
> 1. Multi-path is relevant to more layers of the I/O stack than just
> SCSI. Thus, it makes sense to do it at the layer just above bio.  This
> would either be md/multipath or the Device Mapper multi-path module.

Are the compatible?  What's the downside of one vs another?

> 4. Configuration of this solution would be extremely important.  The
> idea here is to rely on the udev solution currently making its way into
> the kernel and essentially have a vendor specific multi-path
> configuration as a udev plug-in.
> 
> 5. Vendor value add for specific devices could be encoded both as
> configuration (udev) pieces and plug-ins to the upper layer multi-path
> driver to activate any proprietary vendor specific configuration options
> that may be needed for specific solutions.

What are examples of such value add?
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-05 17:06   ` James Bottomley
@ 2003-08-07 11:00     ` Alan Cox
  0 siblings, 0 replies; 15+ messages in thread
From: Alan Cox @ 2003-08-07 11:00 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

On Maw, 2003-08-05 at 18:06, James Bottomley wrote:
> On Tue, 2003-08-05 at 09:48, Alan Cox wrote:
> > On Maw, 2003-08-05 at 04:54, James Bottomley wrote:
> > > transport errors (relevant to multi-path) and medium errors (relevant to
> > > software raid).
> > 
> > And multimedia..
> 
> Yes, multi-media probably wants all error indications.

More importantly it wants to control retries


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-06 20:26   ` Steven Dake
@ 2003-08-07  7:38     ` Lars Marowsky-Bree
  0 siblings, 0 replies; 15+ messages in thread
From: Lars Marowsky-Bree @ 2003-08-07  7:38 UTC (permalink / raw)
  To: Steven Dake, Patrick Mansfield
  Cc: James Bottomley, SCSI Mailing List, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

On 2003-08-06T13:26:41,
   Steven Dake <sdake@mvista.com> said:

> There are problems with multipath in the md driver, specifically how to
> manage partitions.  Each partition requires a seperate multipath. 

This is indeed a deficiency, but it all depends on how the DM device is
setup. You can also setup the DM for the physical device and then the
partitions on top of the DM, either using the partition code in the
kernel or, better, using user-space discovery and creating appropriate
DM devices to match the partitions.

> Changing partition sizes is quite difficult after multipaths are setup. 
> One could argue that partitions should be managed by device mapper, but
> unfortunately all firmware doesn't know about device mapper devices,
> requiring the use of partitions.

Whether DM manages partitions or not is unrelated to whether firmware
knows about them; firmware depends on how the metadata is represented on
disk, not how (and where, whether 'in kernel' or DM) Linux choses to
interpret it.

> Also most people don't want to mess with device mapper.

Tough ;-)

> I like the idea of a generic queueing interface for block commands
> integrated with multipathing as it solves the partitioning problem quite
> nicely.

This would of course also be nice.

Sorry for getting into the discussion so late on the list here. I really
managed to miss it. I wonder how ever I did that...

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
SuSE Labs - Research & Development, SuSE Linux AG

"If anything can go wrong, it will." "Chance favors the prepared (mind)."
  -- Capt. Edward A. Murphy            -- Louis Pasteur

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-06  0:14 ` Patrick Mansfield
@ 2003-08-06 20:26   ` Steven Dake
  2003-08-07  7:38     ` Lars Marowsky-Bree
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Dake @ 2003-08-06 20:26 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: James Bottomley, SCSI Mailing List, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5695 bytes --]

On Tue, 2003-08-05 at 17:14, Patrick Mansfield wrote:
> James -
> 
> Thanks for the summary.
> 
> On Mon, Aug 04, 2003 at 08:54:55PM -0700, James Bottomley wrote:
> 
> > 1. Multi-path is relevant to more layers of the I/O stack than just
> > SCSI. Thus, it makes sense to do it at the layer just above bio.  This
> > would either be md/multipath or the Device Mapper multi-path module.
> 
> I was hoping for linux scsi to evolve into a "native queueing driver" [1],
> adding multi-path to such a driver would be appropriate (of course IMO),
> users of the native queueing driver would then get multi-path support.
> (This is what I meant when referencing the "packet command interface" at
> the SCSI BOF, sorry if the name made no sense, I thought there had been
> earlier references to a common "packet interface" driver or such.)
> 
> Given the consensus for md/dm, I'm not planning any further work on a scsi
> mid-level solution, though technically I prefer the mid-level approach.
> 
Patrick,

There are problems with multipath in the md driver, specifically how to
manage partitions.  Each partition requires a seperate multipath. 
Changing partition sizes is quite difficult after multipaths are setup. 
One could argue that partitions should be managed by device mapper, but
unfortunately all firmware doesn't know about device mapper devices,
requiring the use of partitions.  Also most people don't want to mess
with device mapper.

I like the idea of a generic queueing interface for block commands
integrated with multipathing as it solves the partitioning problem quite
nicely.

As an example of the problems associated with multipath in the md
driver, I've attached a small program that automatically configures SCSI
multipaths using the md layer driver (requires devfs, but should be easy
for you:) to modify).  But after this, it becomes incredibly difficult
to manage multipaths during partition changes.

There are solutions to manage this (online change of multipath size
during fdisk), but it would be better if multipath didn't have to
multipath the partitions as well.

After running the program, look at /proc/mdstat to see all of the
multipaths automatically configured.  While its pretty neat, it makes
managing partitions impossible (or does someone have a way??).

Thanks
-steve


> One other issue discussed at the multi-path BOF is the lack of character
> device (tape) support - dm does not work for such devices. (We do not need
> a multi-ported tape device to see multi-path in linux, multiple
> initiators on the same transport/bus/etc. also show up as multi-path).
> 
> Some other points following.
> 
> > 3.  It was noted that symmetric active multi-path in this scheme is not
> > possible without the ability to place a proper elevator above the
> > multi-pathing driver (and have a simple queue only noop elevator
> > below).  This should help alleviate the current fragmentation issues
> > where symmetric active multi-path produces I/O in decidedly non-optimal
> > page sized chunks.
> 
> Related to queueing - we also need to queue commands (in dm) to avoid
> sending too many commands to the actual device: dm should not send more
> than scsi_device->queue_depth commands.
> 
> queue_depth changes via user (sysfs) or kernel space should eventually be
> addressed (right now only one LLDD is using the scsi_track_queue_full).
> 
> We should eventually export scsi_host attributes (i.e. host_busy reached
> can_queue limit, and host_blocked) such that dm can avoid congested or
> blocked hosts.
> 
> We need to ensure that scsi_device fields (generally the per device state-like)
> function properly when used with multi-path dm, including:
> 
> 	access_count - probably OK with latest ref count changes, so a
> 	call to the release function by dm should remove a scsi_device (if
> 	scsi_remove_device was called on an active scsi_device), I don't
> 	know dm/md enough as to when/how it might release a path/device
> 
> 	online - more below
> 
> 	was_reset - probably OK, since it is somewhat path specific
> 
> 	expecting_cc_ua - probably OK, same as was_reset
> 
> 	device_blocked - QUEUE FULL was seen, we don't want commands
> 	on a given path to be starved out
> 
> 	sdev_state - Mike's changes, I haven't looked at if/how it's
> 	affected relative to dm multi-path
> 
> For the online flag: on timeout, if we fast fail and do not try to recover
> the device or transport, the device could be left online, and leave it to
> dm to not send any further IO requests. This also might protect us from device
> resets (other paths might have active IO). But this means a timeout might
> take a dm path offline, and retrying on a separate path could offline all
> paths to the device.
> 
> > infrastructure for us (in 2.6.0-test2).  The attached patch should add
> > the fast fail capability to SCSI (although without the upwards/downwards
> > failure indications) and we should be able to build the rest of the
> > infrastructure on this framework.
> 
> What about a MEDIUM_ERROR - will all sectors be seen as completed with no
> error for partial completion of IO (uptodate is 1 in scsi_end_request,
> but your patch sets sectors = req->hard_nr_sectors)?
> 
> Per above the error handler (cmd timeout) should not requeue/retry if fast
> fail is set (in scsi_eh_flush_done_q). And, should the error handler
> recovery/resetting run for fast fail?
> 
> [1] http://marc.theaimsgroup.com/?l=linux-kernel&m=105400909207359&w=2
> 
> -- Patrick Mansfield
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[-- Attachment #2: automp.c --]
[-- Type: text/x-c, Size: 10251 bytes --]

/*
 * Copyright (C) 2003 MontaVista Software, Inc.
 *
 *	Author: Steven Dake (sdake@mvista.com)
 *
 * GPL v2 License
 */
#include <sys/sysmacros.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/major.h>
#include <string.h>
#include <linux/list.h>
#include <scsi/scsi.h>
#include <scsi/scsi_ioctl.h>
#include <dirent.h>
#include <stdio.h>

#include <linux/fs.h>
#include <linux/raid/md_u.h>


struct scsi_device_strings {
	char vendor[9];
	char model[17];
	char rev[5];
	char serial[9];
};

union ieee_id_map {
	unsigned long long ieee_id;
	unsigned char ieee_id_u8[8];
};

struct scsi_device {
	struct scsi_device_strings scsi_device_strings;
	unsigned long long ieee_id;
	unsigned char lun;
	unsigned char host;
	unsigned char bus;
	unsigned char id;
	unsigned int partmap;
};

struct scsi_get_idlun {
	unsigned char id;
	unsigned char lun;
	unsigned char bus;
	unsigned char host;
	unsigned int host_unique_id;
};

struct device_number {
	int number;
	struct device_number *next;
};

struct multipath {
	struct device_number *device_number_head;
	struct scsi_device *scsi_device;
	int paths;
};

struct inquiry_command {
	unsigned int input_size;
	unsigned int output_size;
	char cmd[6];
};

struct inquiry_result {
	unsigned int input_size;
	unsigned int output_size;
	unsigned char device_type;
	unsigned char device_modifier;
	unsigned char version;
	unsigned char data_format;
	unsigned char length;
	unsigned char reserved1;
	unsigned char reserved2;
	unsigned char state;
	unsigned char vendor[8];
	unsigned char model[16];
	unsigned char rev[4];
	unsigned char serial[12];
	unsigned char reserved[39];
};

struct extended_inquiry_result {
	unsigned int input_size;
	unsigned int output_size;
	unsigned char junk[8];
	unsigned char ieee_id[8];
};


int get_scsi_device (int minor, struct scsi_device *device)
{
unsigned char ioctl_data[256];
struct inquiry_command *inquiry_command;
struct inquiry_result *inquiry_result;
struct extended_inquiry_result *extended_inquiry_result;
int fd;
int i, j;
int result;
unsigned char *p;
struct scsi_get_idlun scsi_get_idlun;
union ieee_id_map ieee_id_map;

	result = mknod ("this", 0600 | S_IFCHR, makedev (SCSI_GENERIC_MAJOR, minor));
	if (result) {
		return (-1);
	}

	fd = open ("this", O_RDWR);
	if (fd == -1) {
		unlink ("this");
		return (-1);
	}

	memset (ioctl_data, 0, sizeof (ioctl_data));
	
	/*
	 * Execute inquiry command to get SCSI serial number
	 */
	inquiry_command = ioctl_data;
	inquiry_command->input_size = 0;
	inquiry_command->output_size = sizeof (struct inquiry_result);
	inquiry_command->cmd[0] = 0x12;
	inquiry_command->cmd[1] = 0x00;
	inquiry_command->cmd[2] = 0x00;
	inquiry_command->cmd[3] = 0x00;
	inquiry_command->cmd[4] = 96;
	inquiry_command->cmd[5] = 0x00;

	result = ioctl (fd, 1, inquiry_command);

	inquiry_result = ioctl_data;
	strncpy (device->scsi_device_strings.vendor, inquiry_result->vendor, 8);
	strncpy (device->scsi_device_strings.model, inquiry_result->model, 16);
	strncpy (device->scsi_device_strings.rev, inquiry_result->rev, 4);
	strncpy (device->scsi_device_strings.serial, inquiry_result->serial, 12);

	/*
	 * Get IEEE unique ID (FibreChannel WWN) from EVPD page 0x83
	 */
	inquiry_command->input_size = 0;
	inquiry_command->output_size = sizeof (struct extended_inquiry_result);
	inquiry_command->cmd[0] = 0x12;
	inquiry_command->cmd[1] = 0x01;
	inquiry_command->cmd[2] = 0x83;
	inquiry_command->cmd[3] = 0x00;
	inquiry_command->cmd[4] = 96;
	inquiry_command->cmd[5] = 0x00;

	extended_inquiry_result = ioctl_data;

	result = ioctl (fd, 1, inquiry_command);

	for (i = 0, j = 7; i < 8; i++, j--) {
		ieee_id_map.ieee_id_u8[j] = extended_inquiry_result->ieee_id[i];
	}

	device->ieee_id = ieee_id_map.ieee_id;

for (i = 0; i < 30; i++) {
	printf ("%02x,", extended_inquiry_result->ieee_id[i]);
}
printf ("\n\n");

	
	/*
	 * Get path to device
	 */
	result = ioctl (fd, SCSI_IOCTL_GET_IDLUN, &scsi_get_idlun);
	device->host = scsi_get_idlun.host;
	device->bus = scsi_get_idlun.bus;
	device->id = scsi_get_idlun.id;
	device->lun = scsi_get_idlun.lun;

	close (fd);

	unlink ("this");

	return (0);
}

int sd_major (int devno) {
	return (8);
}
int sd_minor (int devno) {
	return (16 * devno);
}

int g_md_minor = 255;

int get_md_minor (void) {
	return (g_md_minor--);
}

int configure_path (struct multipath *path)
{
	mdu_param_t mdu_p;
	mdu_version_t mdu_v;
	mdu_array_info_t mdu_a;
	mdu_disk_info_t mdu_d;
	int fd;
	int disk_fd;
	int result;
	char path_to_md[256];
	char path_to_disk[256];
	struct device_number *device_number;
	int i;
	int part;
	int disk_size;
	int md_minor;

	for (part = 0; part < 16; part++) {
		/*
		 * Skip when partition map bit not set
		 */
		if (part > 0 && ((path->scsi_device->partmap & (1 << part - 1)) == 0)) {
			continue;
		}
		md_minor = get_md_minor ();

		sprintf (path_to_md, "/dev/md/%d", md_minor);
		fd = open (path_to_md, O_RDONLY);

		result = ioctl (fd, RAID_VERSION, &mdu_v);

		if (part == 0) {
		sprintf (path_to_disk, "/dev/scsi/host%d/bus%d/target%d/lun%d/disc",
			path->scsi_device->host,
			path->scsi_device->bus,
			path->scsi_device->id,
			path->scsi_device->lun);
		} else {
		sprintf (path_to_disk, "/dev/scsi/host%d/bus%d/target%d/lun%d/part%d",
			path->scsi_device->host,
			path->scsi_device->bus,
			path->scsi_device->id,
			path->scsi_device->lun,
			part);
		}

		disk_fd = open (path_to_disk, O_RDONLY);
		result = ioctl (disk_fd, BLKGETSIZE, &disk_size);
		disk_size = disk_size / 2;
		close (disk_fd);
		
		mdu_a.active_disks = path->paths;
		mdu_a.working_disks = path->paths;
		mdu_a.level = -4;
		mdu_a.size = disk_size;
		mdu_a.raid_disks = path->paths;
		mdu_a.md_minor = md_minor;
		mdu_a.not_persistent = 1;
		mdu_a.state = 0;
		mdu_a.spare_disks = 0;
		mdu_a.failed_disks = path->paths;
		mdu_a.nr_disks = 2;
		mdu_a.layout = 0;
		mdu_a.chunk_size = 0;
		result = ioctl (fd, SET_ARRAY_INFO, &mdu_a);

		device_number = path->device_number_head;
		for (i = 0; i < path->paths; i++) {
			mdu_d.number = i;
			mdu_d.raid_disk = i;
			mdu_d.state = 6;
		mdu_d.major = sd_major (device_number->number);
		mdu_d.minor = sd_minor (device_number->number) + part;

		result = ioctl (fd, ADD_NEW_DISK, &mdu_d);
		device_number = device_number->next;
		}

		memset (&mdu_p, 0, sizeof (mdu_param_t));
		mdu_p.personality = -4;
		mdu_p.chunk_size = 0;

		result = ioctl (fd, RUN_ARRAY, &mdu_p);
		printf ("Multipath created: /dev/md/%d.\n", md_minor);
	}
}

int main (void)
{
struct scsi_device scsi_device_array[256];
struct multipath mp_table[256];
int i, j, next_loc, new_entry;
int result;
int scsi_device_count;
struct device_number *devno;
int print_device_list = 0;
int print_mp_list = 1;
DIR *dir;
off_t basep;
unsigned char buffer[1024];
char path_to_device[128];
struct dirent *dirent;
int part;
int part_count;
int fd;

	for (i = 0; i < 256; i++) {
		result = get_scsi_device (i, &scsi_device_array[j]);
		if (result == 0) {
			j++;
		}
	}

print_device_list = 1;
	if (print_device_list) {
		for (i = 0; i < j; i++) {
			printf ("device [%d] vendor [%s] model [%s] rev [%s] serial [%s]",
				i,
				scsi_device_array[i].scsi_device_strings.vendor,
				scsi_device_array[i].scsi_device_strings.model,
				scsi_device_array[i].scsi_device_strings.rev,
				scsi_device_array[i].scsi_device_strings.serial
			);
			if (scsi_device_array[i].ieee_id) {
				printf (" IEEE ID [%llx]\n", scsi_device_array[i].ieee_id);
			} else {
				printf ("\n");
			}
		}
	}

	scsi_device_count = j;

	/*
	 * Build multipath information
	 */
	for (next_loc = 0, i = 0; i < scsi_device_count; i++) {
		mp_table[next_loc].scsi_device = &scsi_device_array[i];
		mp_table[next_loc].device_number_head = 0;
		mp_table[next_loc].paths = 0;

		for (j = i; j < scsi_device_count; j++) {
			if (i == j) {
				continue;
			}
			if (memcmp (&scsi_device_array[i].scsi_device_strings, &scsi_device_array[j].scsi_device_strings, sizeof (struct scsi_device_strings)) == 0) {
				/*
				 * If this is the first found multiple path, create first link
				 */
				if (mp_table[next_loc].device_number_head == 0) {
					devno = (struct device_number *)malloc (sizeof (struct device_number));
					devno->next = 0;
					devno->number = i;
					mp_table[next_loc].device_number_head = devno;
					mp_table[next_loc].paths = 1;
				}

				/* * Create link for this new path
				 */
				devno = (struct device_number *)malloc (sizeof (struct device_number));
				devno->next = mp_table[next_loc].device_number_head;
				mp_table[next_loc].device_number_head = devno;
				devno->number = j;
				new_entry = 1;

				mp_table[next_loc].paths++;
			}
		}
		if (new_entry) {
			new_entry = 0;
			next_loc += 1;
		}
	}

	if (print_mp_list) {
		printf ("Multiple paths found:\n");
	}
	for (i = 0; i < next_loc; i++) {
		if (print_mp_list) {
			printf ("vendor [%s] model [%s] rev [%s] serial [%s]\n",
				scsi_device_array[i].scsi_device_strings.vendor,
				scsi_device_array[i].scsi_device_strings.model,
				scsi_device_array[i].scsi_device_strings.rev,
				scsi_device_array[i].scsi_device_strings.serial);
		}

		for (devno = mp_table[i].device_number_head; devno; devno = devno->next) {
			if (print_mp_list) {
				printf ("\tdevice no %d: /dev/scsi/host%d/bus%d/target%d/lun%d\t",
					devno->number,
					scsi_device_array[devno->number].host,
					scsi_device_array[devno->number].bus,
					scsi_device_array[devno->number].id,
					scsi_device_array[devno->number].lun);
			}

			sprintf (path_to_device, "/dev/scsi/host%d/bus%d/target%d/lun%d", 
				scsi_device_array[devno->number].host,
				scsi_device_array[devno->number].bus,
				scsi_device_array[devno->number].id,
				scsi_device_array[devno->number].lun);

			dir = opendir (path_to_device);
			part_count = 0;
			do {
				dirent = readdir (dir);
				if (dirent) {
					if (strncmp (dirent->d_name, "part", 4) == 0) {
						part = atoi (&dirent->d_name[4]);
						scsi_device_array[devno->number].partmap |= (1 << part - 1);
						part_count++;
					}
				}
			} while (dirent);
			closedir (dir);
			if (print_mp_list) {
				printf ("[1 disc, %d partitions]\n", part_count);
			}
		}

		printf ("\n");
	}

	for (i = 0; i < next_loc; i++) {
		configure_path (&mp_table[i]);
	}
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-05  3:54 James Bottomley
  2003-08-05 16:48 ` Alan Cox
@ 2003-08-06  0:14 ` Patrick Mansfield
  2003-08-06 20:26   ` Steven Dake
  2003-08-07 16:20 ` Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Patrick Mansfield @ 2003-08-06  0:14 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

James -

Thanks for the summary.

On Mon, Aug 04, 2003 at 08:54:55PM -0700, James Bottomley wrote:

> 1. Multi-path is relevant to more layers of the I/O stack than just
> SCSI. Thus, it makes sense to do it at the layer just above bio.  This
> would either be md/multipath or the Device Mapper multi-path module.

I was hoping for linux scsi to evolve into a "native queueing driver" [1],
adding multi-path to such a driver would be appropriate (of course IMO),
users of the native queueing driver would then get multi-path support.
(This is what I meant when referencing the "packet command interface" at
the SCSI BOF, sorry if the name made no sense, I thought there had been
earlier references to a common "packet interface" driver or such.)

Given the consensus for md/dm, I'm not planning any further work on a scsi
mid-level solution, though technically I prefer the mid-level approach.

One other issue discussed at the multi-path BOF is the lack of character
device (tape) support - dm does not work for such devices. (We do not need
a multi-ported tape device to see multi-path in linux, multiple
initiators on the same transport/bus/etc. also show up as multi-path).

Some other points following.

> 3.  It was noted that symmetric active multi-path in this scheme is not
> possible without the ability to place a proper elevator above the
> multi-pathing driver (and have a simple queue only noop elevator
> below).  This should help alleviate the current fragmentation issues
> where symmetric active multi-path produces I/O in decidedly non-optimal
> page sized chunks.

Related to queueing - we also need to queue commands (in dm) to avoid
sending too many commands to the actual device: dm should not send more
than scsi_device->queue_depth commands.

queue_depth changes via user (sysfs) or kernel space should eventually be
addressed (right now only one LLDD is using the scsi_track_queue_full).

We should eventually export scsi_host attributes (i.e. host_busy reached
can_queue limit, and host_blocked) such that dm can avoid congested or
blocked hosts.

We need to ensure that scsi_device fields (generally the per device state-like)
function properly when used with multi-path dm, including:

	access_count - probably OK with latest ref count changes, so a
	call to the release function by dm should remove a scsi_device (if
	scsi_remove_device was called on an active scsi_device), I don't
	know dm/md enough as to when/how it might release a path/device

	online - more below

	was_reset - probably OK, since it is somewhat path specific

	expecting_cc_ua - probably OK, same as was_reset

	device_blocked - QUEUE FULL was seen, we don't want commands
	on a given path to be starved out

	sdev_state - Mike's changes, I haven't looked at if/how it's
	affected relative to dm multi-path

For the online flag: on timeout, if we fast fail and do not try to recover
the device or transport, the device could be left online, and leave it to
dm to not send any further IO requests. This also might protect us from device
resets (other paths might have active IO). But this means a timeout might
take a dm path offline, and retrying on a separate path could offline all
paths to the device.

> infrastructure for us (in 2.6.0-test2).  The attached patch should add
> the fast fail capability to SCSI (although without the upwards/downwards
> failure indications) and we should be able to build the rest of the
> infrastructure on this framework.

What about a MEDIUM_ERROR - will all sectors be seen as completed with no
error for partial completion of IO (uptodate is 1 in scsi_end_request,
but your patch sets sectors = req->hard_nr_sectors)?

Per above the error handler (cmd timeout) should not requeue/retry if fast
fail is set (in scsi_eh_flush_done_q). And, should the error handler
recovery/resetting run for fast fail?

[1] http://marc.theaimsgroup.com/?l=linux-kernel&m=105400909207359&w=2

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-05 16:48 ` Alan Cox
@ 2003-08-05 17:06   ` James Bottomley
  2003-08-07 11:00     ` Alan Cox
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2003-08-05 17:06 UTC (permalink / raw)
  To: Alan Cox; +Cc: SCSI Mailing List

On Tue, 2003-08-05 at 09:48, Alan Cox wrote:
> On Maw, 2003-08-05 at 04:54, James Bottomley wrote:
> > transport errors (relevant to multi-path) and medium errors (relevant to
> > software raid).
> 
> And multimedia..

Yes, multi-media probably wants all error indications.

> > 5. Vendor value add for specific devices could be encoded both as
> > configuration (udev) pieces and plug-ins to the upper layer multi-path
> > driver to activate any proprietary vendor specific configuration options
> > that may be needed for specific solutions.
> 
> Vendors will try and arrange that their hotplug only works between
> devices entirely of their form. We must be careful not to encourage them
> in abusing their userbase.

I'm going to try to walk the tightrope here.  I'd like to encourage
vendors to be open, but I know we're trying to bring a lot of long time
proprietary solutions on board, so I'd like to begin by giving them the
flexibility to bring their own particular value add to the table.  If I
fall off the tightrope, I'm sure someone will catch me...

James




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Summary of the Multi-Path BOF at OLS and future directions
  2003-08-05  3:54 James Bottomley
@ 2003-08-05 16:48 ` Alan Cox
  2003-08-05 17:06   ` James Bottomley
  2003-08-06  0:14 ` Patrick Mansfield
  2003-08-07 16:20 ` Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Alan Cox @ 2003-08-05 16:48 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

On Maw, 2003-08-05 at 04:54, James Bottomley wrote:
> transport errors (relevant to multi-path) and medium errors (relevant to
> software raid).

And multimedia..

> 5. Vendor value add for specific devices could be encoded both as
> configuration (udev) pieces and plug-ins to the upper layer multi-path
> driver to activate any proprietary vendor specific configuration options
> that may be needed for specific solutions.

Vendors will try and arrange that their hotplug only works between
devices entirely of their form. We must be careful not to encourage them
in abusing their userbase.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Summary of the Multi-Path BOF at OLS and future directions
@ 2003-08-05  3:54 James Bottomley
  2003-08-05 16:48 ` Alan Cox
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: James Bottomley @ 2003-08-05  3:54 UTC (permalink / raw)
  To: SCSI Mailing List

[-- Attachment #1: Type: text/plain, Size: 3690 bytes --]

Hi All,

For those of you who couldn't attend OLS, I thought a short summary of
what went on might be useful.

Multi-path was a hot topic throughout both the Kernel Summit and OLS.
Thing began with a requirement inputs panel of vendors identifying
multi-path as one of their primary problems.  Followed by an invited
discussion with Lars Marowski-Brée and Mike Anderson on multi-path. At
OLS, there was a paper presentation by Mike and Patrick Mansfield on the
IBM SCSI layer multi-pathing solution and finally there was the BOF
session which tried to pick a way forwards for us in 2.6/2.7

What I'd like to summarise is what I think the conclusions we reached
are:

1. Multi-path is relevant to more layers of the I/O stack than just
SCSI. Thus, it makes sense to do it at the layer just above bio.  This
would either be md/multipath or the Device Mapper multi-path module.

2. Doing multi-path at that level is not easy without fast failure
indications.

2a. On discussion of this, it was decided that on each bio/request, the
upper layers would like to indicate which failures they wish to be fast
and which they wish not to know about.  The two principle ones were
transport errors (relevant to multi-path) and medium errors (relevant to
software raid).

2b. Upwards, on fast failure, we would send back the raw sense data
(probably encoded in the sense request) plus a translated indication of
what the problem was.  The translations would probably be a combination
of (fatal|retryable) and (driver error (card out of
resources/failure)|transport error|medum error).

3.  It was noted that symmetric active multi-path in this scheme is not
possible without the ability to place a proper elevator above the
multi-pathing driver (and have a simple queue only noop elevator
below).  This should help alleviate the current fragmentation issues
where symmetric active multi-path produces I/O in decidedly non-optimal
page sized chunks.

4. Configuration of this solution would be extremely important.  The
idea here is to rely on the udev solution currently making its way into
the kernel and essentially have a vendor specific multi-path
configuration as a udev plug-in.

5. Vendor value add for specific devices could be encoded both as
configuration (udev) pieces and plug-ins to the upper layer multi-path
driver to activate any proprietary vendor specific configuration options
that may be needed for specific solutions.

6. Ownership.  This wasn't exactly discussed, but in light of the
problems with even SCSI-3 reservations, it is becoming clear that
storage ownership in a multi-path configuration is getting impossible to
maintain from user level.  Therefore, I at least will be giving thought
to an ownership API that could be used to manage storage ownership from
the kernel in the face of path fail overs.

As far as the beginnings of implementation go, we already have
md/multi-path.  Joe Thorber of Sistina will shortly be releasing the
code to do multi-path over the device mapper interface, and our trusty
block layer maintainer, Jens Axboe, has done the skeleton of a fast fail
infrastructure for us (in 2.6.0-test2).  The attached patch should add
the fast fail capability to SCSI (although without the upwards/downwards
failure indications) and we should be able to build the rest of the
infrastructure on this framework.

As far as errors and omissions go, I found KS/OLS to go rather fast and
be a bit blurry, so hopefully those who were also present can chime in
on this thread to amplify/correct the points I actually managed to grasp
and summarise the ones I missed.

Thanks,

James

[-- Attachment #2: tmp.diff --]
[-- Type: text/plain, Size: 1198 bytes --]

===== scsi_error.c 1.60 vs edited =====
--- 1.60/drivers/scsi/scsi_error.c	Thu Jul 31 07:32:18 2003
+++ edited/scsi_error.c	Mon Aug  4 14:20:24 2003
@@ -1285,7 +1285,12 @@

       maybe_retry:

-	if ((++scmd->retries) < scmd->allowed) {
+	/* we requeue for retry because the error was retryable, and
+	 * the request was not marked fast fail.  Note that above,
+	 * even if the request is marked fast fail, we still requeue
+	 * for queue congestion conditions (QUEUE_FULL or BUSY) */
+	if ((++scmd->retries) < scmd->allowed 
+	    && !blk_noretry_request(scmd->request)) {
 		return NEEDS_RETRY;
 	} else {
 		/*
===== scsi_lib.c 1.108 vs edited =====
--- 1.108/drivers/scsi/scsi_lib.c	Sat Aug  2 10:18:20 2003
+++ edited/scsi_lib.c	Mon Aug  4 14:26:46 2003
@@ -497,6 +497,13 @@
 	struct request *req = cmd->request;
 	unsigned long flags;

+	/* If failfast is enabled, override the number of completed
+	 * sectors to make sure the entire request is finished right
+	 * now */
+	if(blk_noretry_request(req)) {
+		sectors = req->hard_nr_sectors;
+	}
+
 	/*
 	 * If there are blocks left over at the end, set up the command
 	 * to queue the remainder of them.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-08-08 13:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-08 12:13 Summary of the Multi-Path BOF at OLS and future directions jansen, frank
2003-08-08 12:15 ` Christoph Hellwig
2003-08-08 12:21 ` Josef Möllers
  -- strict thread matches above, loose matches on Subject: below --
2003-08-08 12:28 jansen, frank
2003-08-08 13:27 ` Josef Möllers
2003-08-05  3:54 James Bottomley
2003-08-05 16:48 ` Alan Cox
2003-08-05 17:06   ` James Bottomley
2003-08-07 11:00     ` Alan Cox
2003-08-06  0:14 ` Patrick Mansfield
2003-08-06 20:26   ` Steven Dake
2003-08-07  7:38     ` Lars Marowsky-Bree
2003-08-07 16:20 ` Christoph Hellwig
2003-08-07 23:54   ` Tim Pepper
2003-08-08  6:45   ` Josef Möllers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.