linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)]
       [not found] <20020907164631.GA17696@marowsky-bree.de>
@ 2002-09-07 19:59 ` Peter T. Breuer
  2002-09-07 20:27   ` Rik van Riel
                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-07 19:59 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: linux kernel

"A month of sundays ago Lars Marowsky-Bree wrote:"
> as per your request, I am forwarding this mail to you again. The main point

Thanks.

> you'll find is that yes, I believe that your idea _can_ be made to work. Quite
> frankly, there are very few ideas which _cannot_ be made to work. The
> interesting question is whether it is worth it to a particular route or not.
> 
> And let me say that I find it at least slightly rude to "miss" mail in a
> discussion; if you are so important that you get so much mail every day, maybe
> a public discussion on a mailing list isn't the proper way how to go about
> something...

Well, I'm sorry. I did explain that I am travelling, I think! And
it is even very hard to connevct occasionally (it requires me finding 
an airport kiosk or an internet cafe), and then I have to _pay_ for the
time to compose a reply, and so on! Well, if I don't read your mail for
one day, then it will be filed somewhere for me by procmail, and I
haven't been able to check any filings ..

> > > *ouch* Sure. Right. You just have to read it from scratch every time. How
> > > would you make readdir work?
> > Well, one has to read it from scratch. I'll set about seeing how to do.
> > CLues welcome.
> 
> Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> PvFS etc.

Eh, I thought I saw this - didn't I reply?

> Any of them will appreciate the good work of a bright fellow.

Well, I know of some of these. Intermezzo I've tried lately and found
near impossible to set up and work with (still, a great improvement
over coda, which was absolutely impossible, to within an atom's
breadth). And it's nowhere near got the right orientation. Lustre
people have been pointing me at. What happened to petal?

> Noone appreciates reinventing the wheel another time, especially if - for
> simplification - it starts out as a square.

But what I suggest is finding a simple way to turn an existing FS into a 
distributed one. I.e. NOT reinventing the wheel. All those other people
are reinventing a wheel, for some reason :-).

> You tell me why Distributed Filesystems are important. I fully agree.
> 
> You fail to give a convincing reason why that must be made to work with
> "all" conventional filesystems, especially given the constraints this implies.

Because that's the simplest thing to do.

> Conventional wisdom seems to be that this can much better be handled specially
> by special filesystems, who can do finer grained locking etc because they
> understand the on disk structures, can do distributed journal recovery etc.

Well, how about allowing get_block to return an extra argument, which
is the ondisk placement of the inode(s) concerned, so that the vfs can
issue a lock request for them before the i/o starts. Let the FS return
the list of metadata things to lock, and maybe a semaphore to start the
i/o with.

There you are: instant distribution. It works for thsoe fs's which
cooperate. Make sure the FS can indicate whether it replied or not.

> What you are starting would need at least 3-5 years to catch up with what
> people currently already can do, and they'll improve in this time too. 

Maybe 3-4 weeks more like. The discussion is helping me get a picture,
and when I'm back next week I'll try something. Then, unfortunately I
am away again from the 18th ...

> I've seen your academic track record and it is surely impressive. I am not

I didn't even know it was available anywhere! (or that it was
impressive - thank you).

> saying that your approach won't work within the constraints. Given enough
> thrust, pigs fly. I'm just saying that it would be nice to learn what reasons
> you have for this, because I believe that "within the constraints" makes your
> proposal essentially useless (see the other mails).
> 
> In particular, they make them useless for the requirements you seem to have. A
> petabyte filesystem without journaling? A petabyte filesystem with a single
> write lock? Gimme a break.

Journalling? Well, now you mention it, that would seem to be nice. But
my experience with journalling FS's so far tells me that they break
more horribly than normal.  Also, 1PB or so is the aggregate, not the
size of each FS on the local nodes. I don't think you can diagnose
"journalling" from the numbers. I am even rather loath to journal,
given what I have seen.


> Please, do the research and tell us what features you desire to have which are
> currently missing, and why implementing them essentially from scratch is

No features. Just take any FS that corrently works, and see if you can
distribute it.  Get rid of all fancy features along the way.  You mean
"what's wrong with X"?  Well, it won't be mainstream, for a start, and
that's surely enough.  The projects involved are huge, and they need to
minimize risk, and maximize flexibility. This is CERN, by the way.


> preferrable to extending existing solutions.
> 
> You are dancing around all the hard parts. "Don't have a distributed lock
> manager, have one central lock." Yeah, right, has scaled _really_ well in the
> past. Then you figure this one out, and come up with a lock-bitmap on the
> device itself for locking subtrees of the fs. Next you are going to realize
> that a single block is not scalable either because one needs exclusive write

I made suggestions, hoping that the suggestions would elicit a response
of some kind. I need to explore as much as I can and get as much as I
can back without "doing it first", because I need the insight you can
offer.  I don't have the experience in this area, and I have the
experience to know that I need years of experience with that code to be
able to generate the semantics from scracth.  I'm happy with what I'm
getting.  I hope I'll be able to return soon with a trial patch.


> lock to it, 'cause you can't just rewrite a single bit. You might then begin
> to explore that a single bit won't cut it, because for recovery you'll need to
> be able to pinpoint all locks a node had and recover them. Then you might
> begin to think about the difficulties in distributed lock management and

There is no difficulty with that - there are no distributed locks. All
locks are held on the server of the disk (I decided not to be
complicated to begine with as a matter of principle early in life ;-).

> recovery. ("Transaction processing" is an exceptionally good book on that I
> believe)

Thanks but I don't feel like rolling it out and rolling it back!

> I bet you a dinner that what you are going to come up with will look
> frighteningly like one of the solutions which already exist; so why not

Maybe.

> research them first in depth and start working on the one you like most,
> instead of wasting time on an academic exercise?

Because I don't agree with your assessment of what I should waste my
time on. Though I'm happy to take it into account!

Maybe twenty years ago now I wrote my first disk based file system (for
functional databases) and I didn't like debugging it then! I positively
hate the thought of flattening trees and relating indices and pointers
now :-).

> > So, start thinking about general mechanisms to do distributed storage.
> > Not particular FS solutions.
> 
> Distributed storage needs a way to access it; in the Unix paradigm,
> "everything is a file", that implies a distributed filesystem. Other
> approaches would include accessing raw blocks and doing the locking in the
> application / via a DLM (ie, what Oracle RAC does).

Yep, but we want "normality", just normality writ a bit larger than
normal.

>     Lars Marowsky-Br_e <lmb@suse.de>

Thanks for the input. I don't know what I was supposed to take away
from it though!

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)]
  2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer
@ 2002-09-07 20:27   ` Rik van Riel
  2002-09-07 21:14   ` [RFC] mount flag "direct" Lars Marowsky-Bree
  2002-09-07 23:18   ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Andreas Dilger
  2 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2002-09-07 20:27 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, linux kernel

On Sat, 7 Sep 2002, Peter T. Breuer wrote:

> > Noone appreciates reinventing the wheel another time, especially if - for
> > simplification - it starts out as a square.
>
> But what I suggest is finding a simple way to turn an existing FS into a
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).

To stick with the wheel analogy, while bicycle wheels will
fit on a 40-tonne truck, they won't even get you out of the
parking lot.

> > You tell me why Distributed Filesystems are important. I fully agree.
> >
> > You fail to give a convincing reason why that must be made to work with
> > "all" conventional filesystems, especially given the constraints this implies.
>
> Because that's the simplest thing to do.

You've already admitted that you would need to modify the
existing filesystems in order to create "filesystem independant"
clustered filesystem functionality.

If you're modifying filesystems, surely it no longer is filesystem
independant and you might as well design your filesystem so it can
do clustering in an _efficient_ way.

> > What you are starting would need at least 3-5 years to catch up with what
> > people currently already can do, and they'll improve in this time too.
>
> Maybe 3-4 weeks more like. The discussion is helping me get a picture,
> and when I'm back next week I'll try something. Then, unfortunately I
> am away again from the 18th ...

If you'd only spent 3-4 _days_ looking at clustered filesystems
you would see that it'd take months or years to get something
work decently.


> No features. Just take any FS that corrently works, and see if you can
> distribute it.  Get rid of all fancy features along the way.  You mean
> "what's wrong with X"?  Well, it won't be mainstream, for a start, and
> that's surely enough.  The projects involved are huge, and they need to
> minimize risk, and maximize flexibility. This is CERN, by the way.

All you can hope for now is that CERN doesn't care about data
integrity or performance ;)

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer
  2002-09-07 20:27   ` Rik van Riel
@ 2002-09-07 21:14   ` Lars Marowsky-Bree
  2002-09-08  9:23     ` Peter T. Breuer
  2002-09-07 23:18   ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Andreas Dilger
  2 siblings, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2002-09-07 21:14 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On 2002-09-07T21:59:20,
   "Peter T. Breuer" <ptb@it.uc3m.es> said:

> > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> > PvFS etc.
> Eh, I thought I saw this - didn't I reply?

No, you didn't.

> > Noone appreciates reinventing the wheel another time, especially if - for
> > simplification - it starts out as a square.
> But what I suggest is finding a simple way to turn an existing FS into a 
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).

Well, actually they aren't exactly. The hard part in a "distributed
filesystem" isn't the filesystem itself; while it is very necessary of course.
The locking, synchronization and cluster infrastructure is where the real
difficulty tends to arise.

Yes, it can be argued whether it is in fact easier to create a filesystem from
scratch with clustering in mind (so it is "optimised" for being able to do
fine-grained locking etc), or whether proping a generic clustering layer on
top of existing ones.

The guesstimate of those involved in the past have seemed to suggest that the
first is the case. And I also tend to think this to be the case, but I've been
wrong.

That would - indeed - be very helpful research to do. I would start by
comparing the places where those specialized fs's actually are doing cluster
related stuff and checking whether it can be abstracted, generalized and
improved. In any case, trying to pick apart OpenGFS for example will provide
you more insight into the problem area that a discussion on l-k.

If you want to look into "turn a local fs into a cluster fs", SGI has a
"clustered XFS"; however I'm not too sure how public that extension is. The
hooks might however be in the common XFS core though.

Now, going on with the gedankenexperiment, given a distributed lock manager
(IBM open-sourced one of theirs, though it is not currently perfectly working
;), the locking primitives in the filesystems could "simply" be changed from
local-node SMP spinlocks to cluster-wide locks.

That _should_ to a large degree take care of the locking.

What remains is the invalidation of cache pages; I would expect similar
problems must have arised in NC-NUMA style systems, so looking there should
provide hints.

> > You fail to give a convincing reason why that must be made to work with
> > "all" conventional filesystems, especially given the constraints this
> > implies.
> Because that's the simplest thing to do.

Why? I disagree.

You will have to modify existing file systems quite a bit to work
_efficiently_ in a cluster environment; not even the on-disk layout is
guaranteed to stay consistent as soon as you add per-node journals etc. The
real complexity is in the distributed nature, in particular the recovery (see
below).

"Simplest thing to do" might be to take your funding and give it to the
OpenGFS group or have someone fix the Oracle Cluster FS.

> > In particular, they make them useless for the requirements you seem to
> > have. A petabyte filesystem without journaling? A petabyte filesystem with
> > a single write lock? Gimme a break.
> Journalling? Well, now you mention it, that would seem to be nice.

"Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed
recovery becomes near impossible - at least I don't have a good idea on how to
do it if you don't know what the node had been working on prior to its
failure.

If "take down the entire filesystem on all nodes, run fsck" is your answer to
that, I will start laughing in your face. Because then your requirements are
kind of from outer space and will certainly not reflect a large part of the
user base.

> > Please, do the research and tell us what features you desire to have which
> > are currently missing, and why implementing them essentially from scratch
> > is
> No features.

So they implement what you need, but you don't like them because theres just
so few of them to chose from? Interesting.

> Just take any FS that corrently works, and see if you can distribute it.
> Get rid of all fancy features along the way.  The projects involved are
> huge, and they need to minimize risk, and maximize flexibility. This is
> CERN, by the way.

Well, you are taking quite a risk trying to run a
not-aimed-at-distributed-environments fs and trying to make it distributed by
force. I _believe_ that you are missing where the real trouble lurks.

You maximize flexibility for mediocre solutions; little caching, no journaling
etc.

What does this supposed "flexibility" buy you? Is there any real value in it
or is it a "because!" ?

> You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> and that's surely enough.

I have pulled these two sentences out because I don't get them. What "X" are
you referring to?

> of some kind. I need to explore as much as I can and get as much as I
> can back without "doing it first", because I need the insight you can
> offer.

The insight I can offer you is look at OpenGFS, see and understand what it
does, why and how. The try to come up with a generic approach on how to put
this on top of a generic filesystem, without making it useless.

Then I shall be amazed.

> There is no difficulty with that - there are no distributed locks. All locks
> are held on the server of the disk (I decided not to be complicated to
> begine with as a matter of principle early in life ;-).

Maybe you and I have a different idea of "distributed fs". I thought you had a
central pool of disks.

You want there to be local disks at each server, and other nodes can read
locally and have it appear as a big, single filesystem? You'll still have to
deal with node failure though.

Interesting. 

One might consider to peel apart meta-data (which always goes through the
"home" node) and data (which goes directly to disk via the SAN); if necessary,
the reply to the meta-data request to the home node could tell the node where
to write/read. This smells a lot like cXFS and co with a central metadata
server.

> > recovery. ("Transaction processing" is an exceptionally good book on that
> > I believe)
> Thanks but I don't feel like rolling it out and rolling it back!

Please explain how you'll recover anywhere close to "fast" or even
"acceptable" without transactions. Even if you don't have to fsck the petabyte
filesystem completely, do a benchmark on how long e2fsck takes on, oh, 50gb
only.

> Thanks for the input. I don't know what I was supposed to take away
> from it though!

I apologize and am sorry if you didn't notice.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)]
  2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer
  2002-09-07 20:27   ` Rik van Riel
  2002-09-07 21:14   ` [RFC] mount flag "direct" Lars Marowsky-Bree
@ 2002-09-07 23:18   ` Andreas Dilger
  2 siblings, 0 replies; 31+ messages in thread
From: Andreas Dilger @ 2002-09-07 23:18 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, linux kernel

On Sep 07, 2002  21:59 +0200, Peter T. Breuer wrote:
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> > PvFS etc.
> > Any of them will appreciate the good work of a bright fellow.
> 
> Well, I know of some of these. Intermezzo I've tried lately and found
> near impossible to set up and work with (still, a great improvement
> over coda, which was absolutely impossible, to within an atom's
> breadth). And it's nowhere near got the right orientation. Lustre
> people have been pointing me at. What happened to petal?

Well, Intermezzo has _some_ of what you are looking for, but isn't
really geared to your needs.  It is a distributed _replicated_
filesystem, so it doesn't necessarily scale as good as possible for
the many-nodes-writing case.

Lustre is actually very close to what you are talking about, which I
have mentioned a couple of times before.  It has distributed storage,
so each node could write to its own disk, but it also has a coherent
namespace across all client nodes (clients can also be data servers),
so that all files are accessible from all clients over the network.

It is designed with high performance networking in mind (Quadrics Elan
is what we are working with now) which supports remote DMA and such.

> But what I suggest is finding a simple way to turn an existing FS into a 
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).

We are not re-inventing the on-disk filesystem, only adding the layers
on top (networking, locking) which is absolutely _required_ if you are
going to have a distributed filesystem.  The locking is distributed,
in the sense that there is one node in charge of the filesystem layout
(metadata server, MDS) and it is the lock manager there, but all of the
storage nodes (called object storage targets, OST) are in charge of locking
(and block allocation and such) for files stored on there.

You can't tell me you are going to have a distributed network filesystem
that does not have network support or locking.

> Well, how about allowing get_block to return an extra argument, which
> is the ondisk placement of the inode(s) concerned, so that the vfs can
> issue a lock request for them before the i/o starts. Let the FS return
> the list of metadata things to lock, and maybe a semaphore to start the
> i/o with.

In fact, you can go one better (as Lustre does) - the layout of the data
blocks for a file are totally internal to the storage node.  The clients
only deal with object IDs (inode numbers, essentially) and offsets in
that file.  How the OST filesystem lays out the data in that object is
not visible to the clients at all, so no need to lock the whole
filesystem across nodes to do the allocation.  The clients do not write
directly to the OST block device EVER.

> > What you are starting would need at least 3-5 years to catch up with what
> > people currently already can do, and they'll improve in this time too. 
> 
> Maybe 3-4 weeks more like.

LOL.  This is my full time job for the last 6 months, and I'm just a babe
in the woods.  Sure you could do _something_, but nothing that would get
the performance you want.

> > A petabyte filesystem without journaling? A petabyte filesystem with a
> > single write lock? Gimme a break.
> 
> Journalling? Well, now you mention it, that would seem to be nice. But
> my experience with journalling FS's so far tells me that they break
> more horribly than normal.  Also, 1PB or so is the aggregate, not the
> size of each FS on the local nodes. I don't think you can diagnose
> "journalling" from the numbers. I am even rather loath to journal,
> given what I have seen.

Lustre is what you describe - dozens (hundreds, thousands?) of independent
storage targets, each controlling part of the total storage space.
Even so, journaling is crucial for recovery of metadata state, and
coherency between the clients and the servers, unless you don't think
that hardware or networks ever fail.  Even with distributed storage,
a PB is 1024 nodes with 1TB of storage each, and that will still take
a long time to fsck just one client, let alone return to filesystem
wide coherency.

> No features. Just take any FS that corrently works, and see if you can
> distribute it.  Get rid of all fancy features along the way.  You mean
> "what's wrong with X"?

Again, you are preaching to the choir here.  In principle, Lustre does
what you want, but not with the "one big lock for the whole system"
approach, and it doesn't intrude into the VFS or need no-cache operation
because the clients DO NOT write directly onto the block device on the
OST.  The DO communicate directly to the OST (so you have basically
linear I/O bandwidth scaling with OSTs and clients), but the OST uses
the normal VFS/filesystem to handle block allocation internally.

> Well, it won't be mainstream, for a start, and
> that's surely enough.  The projects involved are huge, and they need to
> minimize risk, and maximize flexibility. This is CERN, by the way.

We are working on the filesystem for MCR (http://www.llnl.gov/linux/mcr/),
a _very large_ cluster at LLNL, 1000 4way 2.4GHz P4 client nodes, 100 TB
of disk, etc.  (as an aside, sys_statfs wraps at 16TB for one filesystem,
I already saw, but I think I can work around it... ;-)

> There is no difficulty with that - there are no distributed locks. All
> locks are held on the server of the disk (I decided not to be
> complicated to begine with as a matter of principle early in life ;-).

See above.  Even if the server holds all of the locks for its "area" (as
we are doing) you are still "distributing" the locks to the clients when
they want to do things.  The server still has to revoke those locks when
another client wants them, or your application ends up doing something
similar.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-07 21:14   ` [RFC] mount flag "direct" Lars Marowsky-Bree
@ 2002-09-08  9:23     ` Peter T. Breuer
  2002-09-08  9:59       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-08  9:23 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel

"A month of sundays ago Lars Marowsky-Bree wrote:"
> > > In particular, they make them useless for the requirements you seem to
> > > have. A petabyte filesystem without journaling? A petabyte filesystem with
> > > a single write lock? Gimme a break.
> > Journalling? Well, now you mention it, that would seem to be nice.
> 
> "Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed
> recovery becomes near impossible - at least I don't have a good idea on how to

It's OK. The calculations are duplicated and the FS's are too. The
calculation is highly parallel.

> do it if you don't know what the node had been working on prior to its
> failure.

Yes we do. Its place in the topology of the network dictates what it was
working on, and anyway that's just a standard parallelism "barrier"
problem.

> Well, you are taking quite a risk trying to run a
> not-aimed-at-distributed-environments fs and trying to make it distributed by
> force. I _believe_ that you are missing where the real trouble lurks.

There is no risk, because, as you say, we can always use nfs or another
off the shelf solution. But 10% better is 10% more experiment for
each timeslot for each group of investigators.

> What does this supposed "flexibility" buy you? Is there any real value in it

Ask the people ho might scream for 10% more experiment in their 2
weeks.

> > You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> > and that's surely enough.
> 
> I have pulled these two sentences out because I don't get them. What "X" are
> you referring to?

Any X that is not a standard FS. Yes, I agree, not exact.

> The insight I can offer you is look at OpenGFS, see and understand what it
> does, why and how. The try to come up with a generic approach on how to put
> this on top of a generic filesystem, without making it useless.
> 
> Then I shall be amazed.

I have to catch a plane ..

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-08  9:23     ` Peter T. Breuer
@ 2002-09-08  9:59       ` Lars Marowsky-Bree
  2002-09-08 16:46         ` Peter T. Breuer
  0 siblings, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2002-09-08  9:59 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On 2002-09-08T11:23:39,
   "Peter T. Breuer" <ptb@it.uc3m.es> said:

> > do it if you don't know what the node had been working on prior to its
> > failure.
> Yes we do. Its place in the topology of the network dictates what it was
> working on, and anyway that's just a standard parallelism "barrier"
> problem.

I meant wrt what is had been working on in the filesystem. You'll need to do a
full fsck locally if it isn't journaled. Oh well.

Maybe it would help if you outlined your architecture as you see it right now.

> > Well, you are taking quite a risk trying to run a
> > not-aimed-at-distributed-environments fs and trying to make it distributed
> > by force. I _believe_ that you are missing where the real trouble lurks.
> There is no risk, because, as you say, we can always use nfs or another off
> the shelf solution. 

Oh, so the discussion is a purely academic mind experiment; it would have been
helpful if you told us in the beginning.

> But 10% better is 10% more experiment for each timeslot
> for each group of investigators.

> > What does this supposed "flexibility" buy you? Is there any real value in
> > it
> Ask the people ho might scream for 10% more experiment in their 2 weeks.

> > > You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> > > and that's surely enough.
> > I have pulled these two sentences out because I don't get them. What "X" are
> > you referring to?
> Any X that is not a standard FS. Yes, I agree, not exact.

So, your extensions are going to be "more" mainstream than OpenGFS / OCFS etc?
What the hell have you been smoking?

It has become apparent in the discussion that you are optimizing for a very
rare special case. OpenGFS, Lustre etc at least try to remain useable for
generic filesystem operation.

That it won't be mainstream is wrong about _your_ approach, not about those
"off the shelves" solutions.

And your special "optimisations" (like, no caching, no journaling...) are
supposed to be 10% _faster_ overall than these which are - to a certain extent
- from the ground up optimised for this case?

One of us isn't listening while clue is knocking. 

Now it might be me, but then I apologize for having wasted your time and will
stand corrected as soon as you have produced working code.

Until then, have fun. I feel like I am wasting both your and my time, and this
isn't strictly necessary.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-08  9:59       ` Lars Marowsky-Bree
@ 2002-09-08 16:46         ` Peter T. Breuer
  0 siblings, 0 replies; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-08 16:46 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel

"A month of sundays ago Lars Marowsky-Bree wrote:"
> On 2002-09-08T11:23:39,
>    "Peter T. Breuer" <ptb@it.uc3m.es> said:
> 
> > > do it if you don't know what the node had been working on prior to its
> > > failure.
> > Yes we do. Its place in the topology of the network dictates what it was
> > working on, and anyway that's just a standard parallelism "barrier"
> > problem.
> 
> I meant wrt what is had been working on in the filesystem. You'll need to do a
> full fsck locally if it isn't journaled. Oh well.

Well, something like that anyway.

> Maybe it would help if you outlined your architecture as you see it right now.

I did in another post, I think.  A torus with local 4-way direct
connectivity with each node connected to three neigbours and exporting
one local resource and importing three more from neighbours.  All
shared.  Add raid to taste.

> > There is no risk, because, as you say, we can always use nfs or another off
> > the shelf solution. 
> 
> Oh, so the discussion is a purely academic mind experiment; it would have been

Puhleeese try not to go off the deep end at an innocent observation.
Take the novocaine or something. I am just pointing out that there are
obvious safe fallbacks, AND ...

> helpful if you told us in the beginning.
> 
> > But 10% better is 10% more experiment for each timeslot
> > for each group of investigators.

You see?

> > > you referring to?
> > Any X that is not a standard FS. Yes, I agree, not exact.
> 
> So, your extensions are going to be "more" mainstream than OpenGFS / OCFS etc?

Quite possibly/probably. Let's see how it goes, shall we?
Do you want to shoot down returning the index of the inode in get_block
in order that we can do a wlock on that index before the io to
the file takes place? Not sufficient in itself, but enough to be going
on with, and enough for FS's that are reasonable in what they do.
Then we need to drop the dcache entry nonlocally.

> What the hell have you been smoking?

Unfortunately nothing at all, let alone worthwhile. 

> It has become apparent in the discussion that you are optimizing for a very

To you, perhaps, not to me. What I am thinking about is a data analysis
farm, handling about 20GB/s of input data in real time, with numbers
of nodes measured in the thousands, and network raided internally. Well,
you'd need a thousand nodes on the first ring alone just to stream
to disk at 20MB/s per node, and that will generate three to six times
that amount of internal traffic just from the raid. So aggregate
bandwidth in the first analysis ring has to be order of 100GB/s.

If the needs are special, it's because of the magnitude of the numbers,
not because of any special quality.

> rare special case. OpenGFS, Lustre etc at least try to remain useable for
> generic filesystem operation.
> 
> That it won't be mainstream is wrong about _your_ approach, not about those
> "off the shelves" solutions.

I'm willing to look at everything.

> And your special "optimisations" (like, no caching, no journaling...) are
> supposed to be 10% _faster_ overall than these which are - to a certain extent

Yep.  Caching looks irrelevant because we read once and write once, by
and large.  You could argue that we write once and read once, which
would make caching sensible, but the data streams are so large to make
it likely that caches would be flooded out anyway.  Buffering would be
irrelevant except inasmuch as it allows for asynchronous operation.

And the network is so involved in this that I would really like to get
rid of the current VMS however I could (it causes pulsing behaviour,
which is most disagreeable).

> - from the ground up optimised for this case?
> 
> One of us isn't listening while clue is knocking. 

You have an interesting bedtime story manner.

> Now it might be me, but then I apologize for having wasted your time and will
> stand corrected as soon as you have produced working code.

Shrug.

> Until then, have fun. I feel like I am wasting both your and my time, and this
> isn't strictly necessary.

!!

There's no argument. I'm simply looking for entry points to the code. 
I've got a lot of good information, especially from Anton (and other
people!), that I can use straight off. My thanks for the insights.

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-04  9:15                   ` Peter T. Breuer
@ 2002-09-04 11:34                     ` Helge Hafting
  0 siblings, 0 replies; 31+ messages in thread
From: Helge Hafting @ 2002-09-04 11:34 UTC (permalink / raw)
  To: ptb; +Cc: linux-kernel

"Peter T. Breuer" wrote:

> There is no problem locking and serializing groups of
> read/write accesses.  Please stop harping on about THAT at
> least :-). What is a problem is marking the groups of accesses.

Sorry, I now see you dealt with that in other threads.
> 
> That's fine. And I don't see what needs to be reread. You had this
> problem once with smp, and you beat it with locks.
> 
Consider that taking a lock on a SMP machine is a fairly fast
operation.  Taking a lock shared over a network probably
takes about 100-1000 times as long.

People submit patches for shaving a single instruction
off the SMP locks, for performance.  The locking is removed
on UP, because it makes a difference even though the
lock never is busy in the UP case.  A much slower lock will
either hurt performance a lot, or force a coarse granularity.
The time spent on locking had better be a small fraction
of total time, or you won't get your high performance.
A coarse granularity will limit your software so the
different machines mostly use different parts of the 
shared disks, or you'll loose the parallellism.
I guess that is fine with you then.


> > it is useless for everything, although it certainly is useless
> > for the purposes I can come up with.  The only uses *I* find
> > for a shared writeable (but uncachable) disk is so special that
> > I wouldn't bother putting a fs like ext2 on it.  
> 
> It's far too inconvenient to be totally without a FS. What we
> want is a normal FS, but slower at some things, and faster at others,
> but correct and shared. It's an approach. The caclulations show
> clearly that r/w  (once!) to existing files are the only performance
> issues. The rest is decor. But decor that is very nice to have around.

Ok.  If r/w _once_ is what matters, then surely you don't
need cache.  I consider that a rather unusual case though,
which is why you'll have a hard time getting this into
the standard kernel.  But maybe you don't need that?

Still, you should consider writing a fs of your own.
It is a _small_ job compared to implementing your locking
system in existing filesystems.  Remember that those
filesystems are optimized for a common case of a few
cpu's, where you may take and release hundreds or 
thousands of locks per second, and where data transfers
often are small and repetitive.  Caching is so
useful for this case that current fs code is designed
around it.

With a fs of your own you won't have to worry about
maintainers changing the rest of the fs
code.  That sort of thing is hard to keep up with
with the massive changes you'll need for your sort
of distributed fs.  A single-purpose fs isn't such a
big job, you can leave out design considerations
that don't apply to your case.  

Helge Hafting

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-04  6:49                 ` Helge Hafting
@ 2002-09-04  9:15                   ` Peter T. Breuer
  2002-09-04 11:34                     ` Helge Hafting
  0 siblings, 1 reply; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-04  9:15 UTC (permalink / raw)
  To: Helge Hafting; +Cc: ptb, linux kernel

"A month of sundays ago Helge Hafting wrote:"
> No problem if all you do is use file data.  A serious problem if
> the stuff you read is used to make a decision about where
> to write something else on that shared disk.  For example:
> The fs need to extend a file.  It reads the free block bitmap,
> and finds a free block.  Then it overwrites that free block,
> and also write back a changed block bitmap.  Unfortunately

That's the exact problem that's already been mentioned twice,
and I'm confident of that one being solved. Lock the whole
FS if necessary, but read the bitmap and lock the bitmap on disk
until the extension is finished and the bitmap is written back.
It has been suggested that the VFS support a "reserve/release blocks"
operation. It would simply mark the ondisk bitmap bits as used
and add them to our available list. Then every file extension
or creation would need to be preceded by a reserve command,
or fail, according to policy.

> some other machine just did the same thing and you
> know have a crosslinked and corrupt file.

There is no problem locking and serializing groups of
read/write accesses.  Please stop harping on about THAT at
least :-). What is a problem is marking the groups of accesses.

> There are several similiar scenarios. You can't really talk
> about "not caching".  Once you read something into
> memory it is "cached" in memory, even if you only use it once
> and then re-read it whenever you need it later.

That's fine. And I don't see what needs to be reread. You had this
problem once with smp, and you beat it with locks.

> > A generic mechanism is not a "no cache fs". It's a generic mechanism.
> > 
> > > Nobody will have time to wait for this, and this alone makes your
> > 
> > Try arguing logically. I really don't like it when people invent their
> > own straw men and then procede to  reason as though it were *mine*.
> > 
> Maybe I wasn't clear.  What I say is that a fs that don't cache
> anything in order to avoid cache coherency problems will be
> too slow for generic use.  (Such as two desktop computers

Quite possibly, but not too slow for reading data in and writing data
out, at gigabyte/s rates overall, which is what the intention is.
That's not general use. And even if it were general use, it would still
be pretty acceptable _in general_.

> > Then imagine some more. I'm not responsible for your imagination ...
> 
> You tell.  You keep asking why your idea won't work and I
> give you "performance problems" _even_ if you sort out the
> correctness issues with no other cost than the lack of cache.

The correctness issues are the only important ones, once we have
correct and fast shared read and write to (existing) files.

> it is useless for everything, although it certainly is useless
> for the purposes I can come up with.  The only uses *I* find
> for a shared writeable (but uncachable) disk is so special that 
> I wouldn't bother putting a fs like ext2 on it.  Sharing a
> raw block device is doable today if you let the programs

It's far too inconvenient to be totally without a FS. What we
want is a normal FS, but slower at some things, and faster at others,
but correct and shared. It's an approach. The caclulations show
clearly that r/w  (once!) to existing files are the only performance
issues. The rest is decor. But decor that is very nice to have around.

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-04  6:21               ` Peter T. Breuer
@ 2002-09-04  6:49                 ` Helge Hafting
  2002-09-04  9:15                   ` Peter T. Breuer
  0 siblings, 1 reply; 31+ messages in thread
From: Helge Hafting @ 2002-09-04  6:49 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel

"Peter T. Breuer" wrote:
> 
> "A month of sundays ago Helge Hafting wrote:"
> > "Peter T. Breuer" wrote:
> > > "A month of sundays ago David Lang wrote:"
> > > > Peter, the thing that you seem to be missing is that direct mode only
> > > > works for writes, it doesn't force a filesystem to go to the hardware for
> > > > reads.
> > >
> > > Yes it does. I've checked! Well, at least I've checked that writing
> > > then reading causes the reads to get to the device driver. I haven't
> > > checked what reading twice does.
> >
> > You tried reading from a file?  For how long are you going to
> 
> Yes I did. And I tried readingtwice too, and it reads twice at device
> level.
> 
> > work on that data you read?  The other machine may ruin it anytime,
> 
> Well, as long as I want to. What's the problem? I read file X at time
> T and got data Y. That's all I need.

No problem if all you do is use file data.  A serious problem if
the stuff you read is used to make a decision about where
to write something else on that shared disk.  For example:
The fs need to extend a file.  It reads the free block bitmap,
and finds a free block.  Then it overwrites that free block,
and also write back a changed block bitmap.  Unfortunately
some other machine just did the same thing and you
know have a crosslinked and corrupt file.

There are several similiar scenarios. You can't really talk
about "not caching".  Once you read something into
memory it is "cached" in memory, even if you only use it once
and then re-read it whenever you need it later.
> 
> > even instantly after you read it.
> 
> So what?
See above. 

> > Now, try "ls -l" twice instead of reading from a file.  Notice
> > that no io happens the second time.  Here we're reading
> 
> Directory data is cached.
> 
> > metadata instead of file data.  This sort of stuff
> > is cached in separate caches that assumes nothing
> > else modifies the disk.
> 
> True, and I'm happy to change it. I don't think we always had a
> directory cache.

> 
> > > > filesystem you end up only haivng this option on the one(s) that you
> > > > modify.
> > >
> > > I intend to make the generic mechanism attractive.
> >
> > It won't be attractive, for the simple reason that a no-cache fs
> > will be devastatingly slow.  A program that read a file one byte at
> 
> A generic mechanism is not a "no cache fs". It's a generic mechanism.
> 
> > Nobody will have time to wait for this, and this alone makes your
> 
> Try arguing logically. I really don't like it when people invent their
> own straw men and then procede to  reason as though it were *mine*.
> 
Maybe I wasn't clear.  What I say is that a fs that don't cache
anything in order to avoid cache coherency problems will be
too slow for generic use.  (Such as two desktop computers
sharing a single main disk with applications and data)
Perhaps it is really useful for some special purpose, I haven't
seen you claim what you want this for, so I assumed general use.

There is nothing illogical about performance problems.  A
cacheless system may _work_ and it might be simple, but
it is also _useless_ for a a lot of common situations
where cached fs'es works fine.


> > The main reason I can imagine for letting two machines write to
> > the *same* disk is performance.  Going cacheless won't give you
> 
> Then imagine some more. I'm not responsible for your imagination ...

You tell.  You keep asking why your idea won't work and I
give you "performance problems" _even_ if you sort out the
correctness issues with no other cost than the lack of cache.

Please tell what you think it can be used for.  I do not say
it is useless for everything, although it certainly is useless
for the purposes I can come up with.  The only uses *I* find
for a shared writeable (but uncachable) disk is so special that 
I wouldn't bother putting a fs like ext2 on it.  Sharing a
raw block device is doable today if you let the programs
using it keep track of data themselves instead of using
a fs.  This isn't what you want though.  It could be interesting
to know what you want, considering what your solution looks like.

Helge Hafting

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-04  5:57             ` Helge Hafting
@ 2002-09-04  6:21               ` Peter T. Breuer
  2002-09-04  6:49                 ` Helge Hafting
  0 siblings, 1 reply; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-04  6:21 UTC (permalink / raw)
  To: Helge Hafting; +Cc: ptb, linux kernel

"A month of sundays ago Helge Hafting wrote:"
> "Peter T. Breuer" wrote:
> > "A month of sundays ago David Lang wrote:"
> > > Peter, the thing that you seem to be missing is that direct mode only
> > > works for writes, it doesn't force a filesystem to go to the hardware for
> > > reads.
> > 
> > Yes it does. I've checked! Well, at least I've checked that writing
> > then reading causes the reads to get to the device driver. I haven't
> > checked what reading twice does.
> 
> You tried reading from a file?  For how long are you going to

Yes I did. And I tried readingtwice too, and it reads twice at device
level.

> work on that data you read?  The other machine may ruin it anytime,

Well, as long as I want to. What's the problem? I read file X at time
T and got data Y. That's all I need.

> even instantly after you read it.

So what?

> Now, try "ls -l" twice instead of reading from a file.  Notice
> that no io happens the second time.  Here we're reading

Directory data is cached. 

> metadata instead of file data.  This sort of stuff
> is cached in separate caches that assumes nothing
> else modifies the disk.

True, and I'm happy to change it. I don't think we always had a
directory cache.

> > > filesystem you end up only haivng this option on the one(s) that you
> > > modify.
> > 
> > I intend to make the generic mechanism attractive.
> 
> It won't be attractive, for the simple reason that a no-cache fs
> will be devastatingly slow.  A program that read a file one byte at

A generic mechanism is not a "no cache fs". It's a generic mechanism.

> Nobody will have time to wait for this, and this alone makes your

Try arguing logically. I really don't like it when people invent their
own straw men and then procede to  reason as though it were *mine*.

> The main reason I can imagine for letting two machines write to
> the *same* disk is performance.  Going cacheless won't give you

Then imagine some more. I'm not responsible for your imagination ...

> that.  But you *can* beat nfs and friends by going for
> a "distributed ext2" or similiar where the participating machines
> talks to each other about who writes where.  
> Each machine locks down the blocks they want to cache, with
> either a shared read lock or a exclusive write lock.

That's already done.

> There is a lot of performance tricks you may use, such as

No tricks. Let's be simple.

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 17:30           ` Peter T. Breuer
  2002-09-03 17:40             ` David Lang
@ 2002-09-04  5:57             ` Helge Hafting
  2002-09-04  6:21               ` Peter T. Breuer
  1 sibling, 1 reply; 31+ messages in thread
From: Helge Hafting @ 2002-09-04  5:57 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel

"Peter T. Breuer" wrote:
> 
> "A month of sundays ago David Lang wrote:"
> > Peter, the thing that you seem to be missing is that direct mode only
> > works for writes, it doesn't force a filesystem to go to the hardware for
> > reads.
> 
> Yes it does. I've checked! Well, at least I've checked that writing
> then reading causes the reads to get to the device driver. I haven't
> checked what reading twice does.

You tried reading from a file?  For how long are you going to
work on that data you read?  The other machine may ruin it anytime,
even instantly after you read it.

Now, try "ls -l" twice instead of reading from a file.  Notice
that no io happens the second time.  Here we're reading
metadata instead of file data.  This sort of stuff
is cached in separate caches that assumes nothing
else modifies the disk.



> 
> If it doesn't cause the data to be read twice, then it ought to, and
> I'll fix it (given half a clue as extra pay ..:-)
> 
> > for many filesystems you cannot turn off their internal caching of data
> > (metadata for some, all data for others)
> 
> Well, let's take things one at a time. Put in a VFS mechanism and then
> convert some FSs to use it.
> 
> > so to implement what you are after you will have to modify the filesystem
> > to not cache anything, since you aren't going to do this for every
> 
> Yes.
> 
> > filesystem you end up only haivng this option on the one(s) that you
> > modify.
> 
> I intend to make the generic mechanism attractive.

It won't be attractive, for the simple reason that a no-cache fs
will be devastatingly slow.  A program that read a file one byte at
a time will do 1024 disk accesses to read a single kilobyte.  And
it will do that again if you run it again.  

Nobody will have time to wait for this, and this alone makes your
idea useless.  To get an idea  - try booting with mem=4M and suffer.
a cacheless fs will be much much worse than that.

Using nfs or similiar will be so much faster.  Existing
network fs'es work around complexities by using one machine as
disk server, others simply transfers requests to and from that machine
and let it sort things out alone.

The main reason I can imagine for letting two machines write to
the *same* disk is performance.  Going cacheless won't give you
that.  But you *can* beat nfs and friends by going for
a "distributed ext2" or similiar where the participating machines
talks to each other about who writes where.  
Each machine locks down the blocks they want to cache, with
either a shared read lock or a exclusive write lock.

There is a lot of performance tricks you may use, such as
pre-reserving some free blocks for each machine, some ranges
of inodes and so on, so each can modify those without
asking the others.  Then re-distribute stuff occationally so
nobody runs out while the others have plenty.


Helge Hafting

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 18:02           ` Andreas Dilger
@ 2002-09-03 18:44             ` Daniel Phillips
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Phillips @ 2002-09-03 18:44 UTC (permalink / raw)
  To: Andreas Dilger, Rik van Riel
  Cc: Peter T. Breuer, Lars Marowsky-Bree, linux kernel

On Tuesday 03 September 2002 20:02, Andreas Dilger wrote:
> Actually, we are using ext3 pretty much as-is for our backing-store
> for Lustre.  The same is true of InterMezzo, and NFS, for that matter.
> All of them live on top of a standard "local" filesystem, which doesn't
> know the things that happen above it to make it a network filesystem
> (locking, etc).

To put this in simplistic terms, this works because you treat the
underlying filesystem simply as a storage device, a slightly funky
kind of disk.

-- 
Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 16:41       ` Peter T. Breuer
                           ` (2 preceding siblings ...)
  2002-09-03 17:29         ` Jan Harkes
@ 2002-09-03 18:31         ` Daniel Phillips
  3 siblings, 0 replies; 31+ messages in thread
From: Daniel Phillips @ 2002-09-03 18:31 UTC (permalink / raw)
  To: ptb, Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel

On Tuesday 03 September 2002 18:41, Peter T. Breuer wrote:
> > Distributed filesystems have a lot of subtle pitfalls - locking, cache
> 
> Yes, thanks, I know.
> 
> > coherency, journal replay to name a few - which you can hardly solve at 
> > the
> 
> My simple suggestion is not to cache. I am of the opinion that in
> principle that solves all coherency problems, since there would be no
> stored state that needs to "cohere". The question is how to identify
> and remove the state that is currently cached.

Well, for example, you would not be able to have the same file open in two 
different kernels because the inode would be cached.  So you'd have to close 
the root directory on one kernel before the other could access any file.  Not 
only would that be horribly inefficient, you would *still* need to implement
a locking protocol between the two kernels to make it work.

There's no magic way of making this easy.

-- 
Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:44   ` Peter T. Breuer
  2002-09-03 16:23     ` Lars Marowsky-Bree
@ 2002-09-03 18:20     ` Daniel Phillips
  1 sibling, 0 replies; 31+ messages in thread
From: Daniel Phillips @ 2002-09-03 18:20 UTC (permalink / raw)
  To: ptb, Anton Altaparmakov; +Cc: Peter T. Breuer, linux kernel

On Tuesday 03 September 2002 17:44, Peter T. Breuer wrote:
> > > Scenario:
> > > I have a driver which accesses a "disk" at the block level, to which
> > > another driver on another machine is also writing. I want to have
> > > an arbitrary FS on this device which can be read from and written to
> > > from both kernels, and I want support at the block level for this idea.
> > 
> > You cannot have an arbitrary fs. The two fs drivers must coordinate with
> > each other in order for your scheme to work. Just think about if the two 
> > fs drivers work on the same file simultaneously and both start growing the
> > file at the same time. All hell would break lose.
> 
> Thanks!
> 
> Rik also mentioned that objection! That's good. You both "only" see
> the same problem, so there can't be many more like it..

(intentionally misinterpreting)  No indeed, there are aren't many problems
like it, in terms of sheer complexity.

-- 
Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 17:26         ` Rik van Riel
@ 2002-09-03 18:02           ` Andreas Dilger
  2002-09-03 18:44             ` Daniel Phillips
  0 siblings, 1 reply; 31+ messages in thread
From: Andreas Dilger @ 2002-09-03 18:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Peter T. Breuer, Lars Marowsky-Bree, linux kernel

On Sep 03, 2002  14:26 -0300, Rik van Riel wrote:
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > > Your approach is not feasible.
> >
> > But you have to be specific about why not. I've responded to the
> > particular objections so far.
> 
> You make it sound like you bet your masters degree on
> doing a distributed filesystem without filesystem support ;)

Actually, we are using ext3 pretty much as-is for our backing-store
for Lustre.  The same is true of InterMezzo, and NFS, for that matter.
All of them live on top of a standard "local" filesystem, which doesn't
know the things that happen above it to make it a network filesystem
(locking, etc).

That isn't to say that I agree with just taking a local filesystem and
putting it on a shared block device and expecting it to work with only
the normal filesystem code.  We do all of our locking above the fs
level, but we do have some help in the VFS (intent-based lookup, patch
in the Lustre CVS repository, if people are interested).

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 17:30           ` Peter T. Breuer
@ 2002-09-03 17:40             ` David Lang
  2002-09-04  5:57             ` Helge Hafting
  1 sibling, 0 replies; 31+ messages in thread
From: David Lang @ 2002-09-03 17:40 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> "A month of sundays ago David Lang wrote:"
> > Peter, the thing that you seem to be missing is that direct mode only
> > works for writes, it doesn't force a filesystem to go to the hardware for
> > reads.
>
> Yes it does. I've checked! Well, at least I've checked that writing
> then reading causes the reads to get to the device driver. I haven't
> checked what reading twice does.
>
> If it doesn't cause the data to be read twice, then it ought to, and
> I'll fix it (given half a clue as extra pay ..:-)

writing then reading the same file may cause it to be read from the disk,
but reading /foo/bar then reading /foo/bar again will not cause two reads
of all data.

some filesystems go to a lot fo work to orginize the metadata in
particular in memory to access things more efficiantly, you will have to
go into each filesystem and modify them to not do this.

in addition you will have lots of potential races as one system reads a
block of data, modifies it, then writes it while the other system does the
same thing. you cannot easily detect this in the low level drivers as
these are seperate calls from the filesystem, and even if you do what
error message will you send to the second system? there's no error that
says 'the disk has changed under you, backup and re-read it before you
modify it'

yes this is stuff that could be added to all filesystems, but will the
filesystem maintainsers let you do this major surgery to their systems?

for example the XFS and JFS teams are going to a lot of effort to maintain
their systems to be compatable with other OS's, they probably won't
appriciate all the extra conditionals that you will need to put in to
do all of this.

even for ext2 there are people (including linus I believe) that are saying
that major new features should not be added to ext2, but to a new
filesystem forked off of ext2 (ext3 for example or a fork of it).

David Lang

> > for many filesystems you cannot turn off their internal caching of data
> > (metadata for some, all data for others)
>
> Well, let's take things one at a time. Put in a VFS mechanism and then
> convert some FSs to use it.
>
> > so to implement what you are after you will have to modify the filesystem
> > to not cache anything, since you aren't going to do this for every
>
> Yes.
>
> > filesystem you end up only haivng this option on the one(s) that you
> > modify.
>
> I intend to make the generic mechanism attractive.
>
> > if you have a single (or even just a few) filesystems that have this
> > option you may as well include the locking/syncing software in them rather
> > then modifying the VFS layer.
>
> Why? Are you advocating a particular approach? Yes, I agree that that
> is a possible way to go - but I will want the extra VFS ops anyway,
> and will want to modify the particular fs to use them, no?
>
> Peter
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 17:07         ` David Lang
@ 2002-09-03 17:30           ` Peter T. Breuer
  2002-09-03 17:40             ` David Lang
  2002-09-04  5:57             ` Helge Hafting
  0 siblings, 2 replies; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-03 17:30 UTC (permalink / raw)
  To: David Lang; +Cc: linux kernel

"A month of sundays ago David Lang wrote:"
> Peter, the thing that you seem to be missing is that direct mode only
> works for writes, it doesn't force a filesystem to go to the hardware for
> reads.

Yes it does. I've checked! Well, at least I've checked that writing
then reading causes the reads to get to the device driver. I haven't
checked what reading twice does.

If it doesn't cause the data to be read twice, then it ought to, and
I'll fix it (given half a clue as extra pay ..:-)

> for many filesystems you cannot turn off their internal caching of data
> (metadata for some, all data for others)

Well, let's take things one at a time. Put in a VFS mechanism and then
convert some FSs to use it.

> so to implement what you are after you will have to modify the filesystem
> to not cache anything, since you aren't going to do this for every

Yes.

> filesystem you end up only haivng this option on the one(s) that you
> modify.

I intend to make the generic mechanism attractive.

> if you have a single (or even just a few) filesystems that have this
> option you may as well include the locking/syncing software in them rather
> then modifying the VFS layer.

Why? Are you advocating a particular approach? Yes, I agree that that
is a possible way to go - but I will want the extra VFS ops anyway, 
and will want to modify the particular fs to use them, no?

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 16:41       ` Peter T. Breuer
  2002-09-03 17:07         ` David Lang
  2002-09-03 17:26         ` Rik van Riel
@ 2002-09-03 17:29         ` Jan Harkes
  2002-09-03 18:31         ` Daniel Phillips
  3 siblings, 0 replies; 31+ messages in thread
From: Jan Harkes @ 2002-09-03 17:29 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-kernel

On Tue, Sep 03, 2002 at 06:41:49PM +0200, Peter T. Breuer wrote:
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > Your approach is not feasible.
> 
> But you have to be specific about why not. I've responded to the
> particular objections so far.
> 
> > Distributed filesystems have a lot of subtle pitfalls - locking, cache
> 
> Yes, thanks, I know.
> 
> > coherency, journal replay to name a few - which you can hardly solve at the
> 
> My simple suggestion is not to cache. I am of the opinion that in
> principle that solves all coherency problems, since there would be no
> stored state that needs to "cohere". The question is how to identify
> and remove the state that is currently cached.

That is a very simple suggestion, but not feasable because there always
be 'cached copies' floating around. Even if you remove the dcache
(directory lookups) and icache (inode cache) in the kernel, both
filesystems will still need to look at the data in order to modify it.
Looking at the data involves creating an in-memory representation of the
object. If there is no locking, if one filesystem modifies the object,
the other filesystem is looking at (and possibly modifying) stale data
which causes consistency problems.

> > Good reading would be any sort of entry literature on clustering, I would
> 
> Please don't condescend! I am honestly not in need of education :-).

I'm afraid that all of this has been very well documented, another
example would be Tanenbaum's "Distributed Systems", especially the
chapter on various consistency models is a nice read.

> We already know that we can have a perfectly fine and arbitrary
> shared file system, shared only at the block level if we 
> 
>   1) permit no new dirs or files to be made (disable O_CREAT or something
>      like)
>   2) do all r/w on files with O_DIRECT
>   3) do file extensions via a new generic VFS "reserve" operation
>   4) have shared mutexes on all vfs op, implemented by passing
>      down a special "tag" request to the block layer.
>   5) maintain read+write order at the shared resource.

Can I quote your 'puhleese' here? Inodes are sharing the same on-disk
blocks, so when one inode is changed (setattr, truncate) and written
back to disk, it affects all other inodes stored in the same block. So
the shared mutexes on the VFS level don't cover the necessary locking.

Each time you add another point to work around the latest argument,
someone will surely give you another argument until you end up with a
system that is no longer practical. And then probably even slower
because you absolutely cannot allow the FS to trust _any_ data without a
locked read or write off the disk (or across the network). And because
you seem to like cpu consistency that much this even involves the data
that happens to be 'cached' in the CPU.

> I have already implemented 2,4,5.
> 
> The question is how to extend the range of useful operations. For the
> moment I would be happy simply to go ahead and implement 1) and 3),
> while taking serious strong advice on what to do about directories.

Perhaps the fact that directories (and journalled filesystems) aren't
already solved is an indication that the proposed 'solution' is flawed?

Filesystems were designed to trust the disk as 'stable storage', i.e.
anything that was read or recently written will be the same. NFS already
weakens this model slightly. AFS and Coda go even further, we only
guarantee that changes are propagated when a file is closed. There is
a callback mechanism to invalidate cached copies. But even when we open
a file, it could still have been changed within the past 1/2 RTT. This
is a window we intentionally live with because it avoids the full RTT
hit we would have if we had to go to the server on every file open.

It is the latency that kills you when you can't cache.

Jan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 16:41       ` Peter T. Breuer
  2002-09-03 17:07         ` David Lang
@ 2002-09-03 17:26         ` Rik van Riel
  2002-09-03 18:02           ` Andreas Dilger
  2002-09-03 17:29         ` Jan Harkes
  2002-09-03 18:31         ` Daniel Phillips
  3 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2002-09-03 17:26 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > Your approach is not feasible.
>
> But you have to be specific about why not. I've responded to the
> particular objections so far.

[snip]

> Please don't condescend! I am honestly not in need of education :-).

You make it sound like you bet your masters degree on
doing a distributed filesystem without filesystem support ;)

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 16:41       ` Peter T. Breuer
@ 2002-09-03 17:07         ` David Lang
  2002-09-03 17:30           ` Peter T. Breuer
  2002-09-03 17:26         ` Rik van Riel
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: David Lang @ 2002-09-03 17:07 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, linux kernel

Peter, the thing that you seem to be missing is that direct mode only
works for writes, it doesn't force a filesystem to go to the hardware for
reads.

for many filesystems you cannot turn off their internal caching of data
(metadata for some, all data for others)

so to implement what you are after you will have to modify the filesystem
to not cache anything, since you aren't going to do this for every
filesystem you end up only haivng this option on the one(s) that you
modify.

if you have a single (or even just a few) filesystems that have this
option you may as well include the locking/syncing software in them rather
then modifying the VFS layer.

David Lang


 On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> Date: Tue, 3 Sep 2002 18:41:49 +0200 (MET DST)
> From: Peter T. Breuer <ptb@it.uc3m.es>
> To: Lars Marowsky-Bree <lmb@suse.de>
> Cc: Peter T. Breuer <ptb@it.uc3m.es>,
>      linux kernel <linux-kernel@vger.kernel.org>
> Subject: Re: [RFC] mount flag "direct"
>
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> >    "Peter T. Breuer" <ptb@it.uc3m.es> said:
> >
> > > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
> > > support necessary :-).
> > >
> > > That's really what my question is driving at. I see that I need to
> > > make VFS ops communicate "tag requests" to the block layer, in
> > > order to implement locking. Now you and Rik have pointed out one
> > > operation that needs locking. My next question is obviously: can you
> > > point me more or less precisely at this operation in the VFS layer?
> > > I've only started studying it and I am relatively unfamiliar with it.
> >
> > Your approach is not feasible.
>
> But you have to be specific about why not. I've responded to the
> particular objections so far.
>
> > Distributed filesystems have a lot of subtle pitfalls - locking, cache
>
> Yes, thanks, I know.
>
> > coherency, journal replay to name a few - which you can hardly solve at the
>
> My simple suggestion is not to cache. I am of the opinion that in
> principle that solves all coherency problems, since there would be no
> stored state that needs to "cohere". The question is how to identify
> and remove the state that is currently cached.
>
> As to journal replay, there will be no journalling - if it breaks it
> breaks and somebody (fsck) can go fix it. I don't want to get anywhere
> near complicated.
>
> > VFS layer.
> >
> > Good reading would be any sort of entry literature on clustering, I would
>
> Please don't condescend! I am honestly not in need of education :-).
>
> > recommend "In search of clusters" and many of the whitepapers Google will turn
> > up for you, as well as the OpenGFS source.
>
> (Puhleeese!)
>
> We already know that we can have a perfectly fine and arbitrary
> shared file system, shared only at the block level if we
>
>   1) permit no new dirs or files to be made (disable O_CREAT or something
>      like)
>   2) do all r/w on files with O_DIRECT
>   3) do file extensions via a new generic VFS "reserve" operation
>   4) have shared mutexes on all vfs op, implemented by passing
>      down a special "tag" request to the block layer.
>   5) maintain read+write order at the shared resource.
>
> I have already implemented 2,4,5.
>
> The question is how to extend the range of useful operations. For the
> moment I would be happy simply to go ahead and implement 1) and 3),
> while taking serious strong advice on what to do about directories.
>
>
>
> Peter
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 16:23     ` Lars Marowsky-Bree
@ 2002-09-03 16:41       ` Peter T. Breuer
  2002-09-03 17:07         ` David Lang
                           ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-03 16:41 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel

"A month of sundays ago Lars Marowsky-Bree wrote:"
>    "Peter T. Breuer" <ptb@it.uc3m.es> said:
> 
> > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
> > support necessary :-).
> > 
> > That's really what my question is driving at. I see that I need to
> > make VFS ops communicate "tag requests" to the block layer, in
> > order to implement locking. Now you and Rik have pointed out one
> > operation that needs locking. My next question is obviously: can you
> > point me more or less precisely at this operation in the VFS layer?
> > I've only started studying it and I am relatively unfamiliar with it.
> 
> Your approach is not feasible.

But you have to be specific about why not. I've responded to the
particular objections so far.

> Distributed filesystems have a lot of subtle pitfalls - locking, cache

Yes, thanks, I know.

> coherency, journal replay to name a few - which you can hardly solve at the

My simple suggestion is not to cache. I am of the opinion that in
principle that solves all coherency problems, since there would be no
stored state that needs to "cohere". The question is how to identify
and remove the state that is currently cached.

As to journal replay, there will be no journalling - if it breaks it
breaks and somebody (fsck) can go fix it. I don't want to get anywhere
near complicated.

> VFS layer.
> 
> Good reading would be any sort of entry literature on clustering, I would

Please don't condescend! I am honestly not in need of education :-).

> recommend "In search of clusters" and many of the whitepapers Google will turn
> up for you, as well as the OpenGFS source.

(Puhleeese!)

We already know that we can have a perfectly fine and arbitrary
shared file system, shared only at the block level if we 

  1) permit no new dirs or files to be made (disable O_CREAT or something
     like)
  2) do all r/w on files with O_DIRECT
  3) do file extensions via a new generic VFS "reserve" operation
  4) have shared mutexes on all vfs op, implemented by passing
     down a special "tag" request to the block layer.
  5) maintain read+write order at the shared resource.

I have already implemented 2,4,5.

The question is how to extend the range of useful operations. For the
moment I would be happy simply to go ahead and implement 1) and 3),
while taking serious strong advice on what to do about directories.



Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:44   ` Peter T. Breuer
@ 2002-09-03 16:23     ` Lars Marowsky-Bree
  2002-09-03 16:41       ` Peter T. Breuer
  2002-09-03 18:20     ` Daniel Phillips
  1 sibling, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2002-09-03 16:23 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On 2002-09-03T17:44:10,
   "Peter T. Breuer" <ptb@it.uc3m.es> said:

> No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
> support necessary :-).
> 
> That's really what my question is driving at. I see that I need to
> make VFS ops communicate "tag requests" to the block layer, in
> order to implement locking. Now you and Rik have pointed out one
> operation that needs locking. My next question is obviously: can you
> point me more or less precisely at this operation in the VFS layer?
> I've only started studying it and I am relatively unfamiliar with it.

Your approach is not feasible.

Distributed filesystems have a lot of subtle pitfalls - locking, cache
coherency, journal replay to name a few - which you can hardly solve at the
VFS layer.

Good reading would be any sort of entry literature on clustering, I would
recommend "In search of clusters" and many of the whitepapers Google will turn
up for you, as well as the OpenGFS source.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 16:04     ` Peter T. Breuer
@ 2002-09-03 16:08       ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2002-09-03 16:08 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Maciej W. Rozycki, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> "A month of sundays ago Maciej W. Rozycki wrote:"
> > On Tue, 3 Sep 2002, Rik van Riel wrote:
> > > And what if they both allocate the same disk block to another
> > > file, simultaneously ?
> >
> >  You need a mutex then.  For SCSI devices a reservation is the way to go
> > -- the RESERVE/RELEASE commands are mandatory for direct-access devices,
> > so thy should work universally for disks.
>
> Is there provision in VFS for this operation?

No. Everybody but you seems to agree these things should be
filesystem specific and not in the VFS.

> (i.e. care to point me at an entry point? I just grepped for "reserve"
> and came up with nothing useful).

Good.

cheers,

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:53   ` Maciej W. Rozycki
@ 2002-09-03 16:04     ` Peter T. Breuer
  2002-09-03 16:08       ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-03 16:04 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Rik van Riel, Peter T. Breuer, linux kernel

"A month of sundays ago Maciej W. Rozycki wrote:"
> On Tue, 3 Sep 2002, Rik van Riel wrote:
> > And what if they both allocate the same disk block to another
> > file, simultaneously ?
> 
>  You need a mutex then.  For SCSI devices a reservation is the way to go
> -- the RESERVE/RELEASE commands are mandatory for direct-access devices,
> so thy should work universally for disks.

Is there provision in VFS for this operation?

(i.e. care to point me at an entry point? I just grepped for "reserve"
and came up with nothing useful).

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:13 ` Rik van Riel
@ 2002-09-03 15:53   ` Maciej W. Rozycki
  2002-09-03 16:04     ` Peter T. Breuer
  0 siblings, 1 reply; 31+ messages in thread
From: Maciej W. Rozycki @ 2002-09-03 15:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Peter T. Breuer, linux kernel

On Tue, 3 Sep 2002, Rik van Riel wrote:

> > Rationale:
> > No caching means that each kernel doesn't go off with its own idea of
> > what is on the disk in a file, at least. Dunno about directories and
> > metadata.
> 
> And what if they both allocate the same disk block to another
> file, simultaneously ?

 You need a mutex then.  For SCSI devices a reservation is the way to go
-- the RESERVE/RELEASE commands are mandatory for direct-access devices,
so thy should work universally for disks.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:37 ` Anton Altaparmakov
@ 2002-09-03 15:44   ` Peter T. Breuer
  2002-09-03 16:23     ` Lars Marowsky-Bree
  2002-09-03 18:20     ` Daniel Phillips
  0 siblings, 2 replies; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-03 15:44 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Peter T. Breuer, linux kernel

"A month of sundays ago Anton Altaparmakov wrote:"
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> 
> > I'll rephrase this as an RFC, since I want help and comments.
> > 
> > Scenario:
> > I have a driver which accesses a "disk" at the block level, to which
> > another driver on another machine is also writing. I want to have
> > an arbitrary FS on this device which can be read from and written to
> > from both kernels, and I want support at the block level for this idea.
> 
> You cannot have an arbitrary fs. The two fs drivers must coordinate with
> each other in order for your scheme to work. Just think about if the two 
> fs drivers work on the same file simultaneously and both start growing the
> file at the same time. All hell would break lose.

Thanks!

Rik also mentioned that objection! That's good. You both "only" see
the same problem, so there can't be many more like it..

I replied thusly:

  OK - reply:
  It appears that in order to allocate away free space, one must first
  "grab" that free space using a shared lock. That's perfectly
  feasible.


> For your scheme to work, the fs drivers need to communicate with each
> other in order to attain atomicity of cluster and inode (de-)allocations,
> etc.

Yes. They must create atomic FS operations at the VFS level (grabbing
unallocated space is one of them) and I must share the locks for those
ops.

> Basically you need a clustered fs for this to work. GFS springs to

No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
support necessary :-).

That's really what my question is driving at. I see that I need to
make VFS ops communicate "tag requests" to the block layer, in
order to implement locking. Now you and Rik have pointed out one
operation that needs locking. My next question is obviously: can you
point me more or less precisely at this operation in the VFS layer?
I've only started studying it and I am relatively unfamiliar with it.

Thanks.

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer
  2002-09-03 15:13 ` Rik van Riel
  2002-09-03 15:16 ` jbradford
@ 2002-09-03 15:37 ` Anton Altaparmakov
  2002-09-03 15:44   ` Peter T. Breuer
  2 siblings, 1 reply; 31+ messages in thread
From: Anton Altaparmakov @ 2002-09-03 15:37 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> I'll rephrase this as an RFC, since I want help and comments.
> 
> Scenario:
> I have a driver which accesses a "disk" at the block level, to which
> another driver on another machine is also writing. I want to have
> an arbitrary FS on this device which can be read from and written to
> from both kernels, and I want support at the block level for this idea.

You cannot have an arbitrary fs. The two fs drivers must coordinate with
each other in order for your scheme to work. Just think about if the two 
fs drivers work on the same file simultaneously and both start growing the
file at the same time. All hell would break lose.

For your scheme to work, the fs drivers need to communicate with each
other in order to attain atomicity of cluster and inode (de-)allocations,
etc.

Basically you need a clustered fs for this to work. GFS springs to
mind but I never really looked at it...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer
  2002-09-03 15:13 ` Rik van Riel
@ 2002-09-03 15:16 ` jbradford
  2002-09-03 15:37 ` Anton Altaparmakov
  2 siblings, 0 replies; 31+ messages in thread
From: jbradford @ 2002-09-03 15:16 UTC (permalink / raw)
  To: ptb; +Cc: linux-kernel

> Rationale:
> No caching means that each kernel doesn't go off with its own idea of
> what is on the disk in a file, at least. Dunno about directories and
> metadata.

Somewhat related to this - is there currently, or would it be possible to include in what you're working on now, a sane way for two or more machines to access a SCSI drive on a shared SCSI bus - in other words, several host adaptors in different machines are all connected to one SCSI bus, and can all access a single hard disk.  At the moment, you can only do this if all machines mount the disk read-only.

John.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] mount flag "direct"
  2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer
@ 2002-09-03 15:13 ` Rik van Riel
  2002-09-03 15:53   ` Maciej W. Rozycki
  2002-09-03 15:16 ` jbradford
  2002-09-03 15:37 ` Anton Altaparmakov
  2 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2002-09-03 15:13 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> Rationale:
> No caching means that each kernel doesn't go off with its own idea of
> what is on the disk in a file, at least. Dunno about directories and
> metadata.

And what if they both allocate the same disk block to another
file, simultaneously ?

A mount option isn't enough to achieve your goal.

It looks like you want GFS or OCFS. Info about GFS can be found at:

	http://www.opengfs.org/
	http://www.sistina.com/  (commercial GFS)

Dunno where Oracle's cluster fs is documented.

regards,

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC] mount flag "direct"
@ 2002-09-03 15:01 Peter T. Breuer
  2002-09-03 15:13 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Peter T. Breuer @ 2002-09-03 15:01 UTC (permalink / raw)
  To: linux kernel

I'll rephrase this as an RFC, since I want help and comments.

Scenario:
I have a driver which accesses a "disk" at the block level, to which
another driver on another machine is also writing. I want to have
an arbitrary FS on this device which can be read from and written to
from both kernels, and I want support at the block level for this idea.

Question:
What do people think of adding a "direct" option to mount, with the
semantics that the VFS then makes all opens on files on the FS mounted
"direct" use O_DIRECT, which means that file r/w is not cached in VMS,
but instead goes straight to and from the real device? Is this enough
or nearly enough for what I have in mind?

Rationale:
No caching means that each kernel doesn't go off with its own idea of
what is on the disk in a file, at least. Dunno about directories and
metadata.

Wish:
If that mount option looks promising, can somebody make provision for
it in the kernel? Details to be ironed out later? 

What I have explored or will explore:
1) I have put shared zoned read/write locks on the remote resource, so each
kernel request locks precisely the "disk" area that it should, in
precisely the mode it should, for precisely the duration of each block
layer request.

2) I have maintained request write order from individual kernels.

3) IMO I should also intercept and share the FS superblock lock, but thats
for later, and please tell me about it. What about dentries? Does
O_DIRECT get rid of them? What happens with mkdir?

4) I would LIKE the kernel to emit a "tag request" on the underlying
device before and after every atomic FS operation, so that I can maintain
FS atomicity at the block level. Please comment. Can somebody make this 
happen, please? Or do I add the functionality to VFS myself? Where?

I have patched the kernel to support mount -o direct, creating MS_DIRECT
and MNT_DIRECT flags for the purpose.  And it works.  But I haven't
dared do too much to the remote FS by way of testing yet. I have
confirmed that individual file contents can be changed without problem
when the file size does not change.

Comments?

Here is the tiny proof of concept patch for VFS that implements the
"direct" mount option.


Peter

The idea embodied in this patch is that if we get the MS_DIRECT flag when
the vfs do_mount() is called, we pass it across into the mnt flags used
by do_add_mount() as MNT_DIRECT and thus make it a permament part of the
vfsmnt object that is the mounted fs.  Then, in the generic
dentry_open() call for any file, we examine the flags on the mnt
parameter and set the O_DIRECT flag on the file pointer if MNT_DIRECT
is set on the vfsmnt object.

That makes all file opens O_DIRECT on the file system in question,
and makes all file accesses uncached by VMS.

The patch in itself works fine.

--- linux-2.5.31/fs/open.c.pre-o_direct	Mon Sep  2 20:36:11 2002
+++ linux-2.5.31/fs/open.c	Mon Sep  2 17:12:08 2002
@@ -643,6 +643,9 @@
 		if (error)
 			goto cleanup_file;
 	}
+        if (mnt->mnt_flags & MNT_DIRECT)
+            f->f_flags |= O_DIRECT;
+
 	f->f_ra.ra_pages = inode->i_mapping->backing_dev_info->ra_pages;
 	f->f_dentry = dentry;
 	f->f_vfsmnt = mnt;
--- linux-2.5.31/fs/namespace.c.pre-o_direct	Mon Sep  2 20:37:39 2002
+++ linux-2.5.31/fs/namespace.c	Mon Sep  2 17:12:04 2002
@@ -201,6 +201,7 @@
 		{ MS_MANDLOCK, ",mand" },
 		{ MS_NOATIME, ",noatime" },
 		{ MS_NODIRATIME, ",nodiratime" },
+		{ MS_DIRECT, ",direct" },
 		{ 0, NULL }
 	};
 	static struct proc_fs_info mnt_info[] = {
@@ -734,7 +741,9 @@
 		mnt_flags |= MNT_NODEV;
 	if (flags & MS_NOEXEC)
 		mnt_flags |= MNT_NOEXEC;
-	flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV);
+	if (flags & MS_DIRECT)
+		mnt_flags |= MNT_DIRECT;
+	flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_DIRECT);
 
 	/* ... and get the mountpoint */
 	retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
--- linux-2.5.31/include/linux/mount.h.pre-o_direct	Mon Sep  2 20:31:16 2002
+++ linux-2.5.31/include/linux/mount.h	Mon Sep  2 18:06:14 2002
@@ -17,6 +17,7 @@
 #define MNT_NOSUID	1
 #define MNT_NODEV	2
 #define MNT_NOEXEC	4
+#define MNT_DIRECT	256
 
 struct vfsmount
 {
--- linux-2.5.31/include/linux/fs.h.pre-o_direct	Mon Sep  2 20:32:05 2002
+++ linux-2.5.31/include/linux/fs.h	Mon Sep  2 18:05:57 2002
@@ -104,6 +104,9 @@
 #define MS_REMOUNT	32	/* Alter flags of a mounted FS */
 #define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
 #define MS_DIRSYNC	128	/* Directory modifications are synchronous */
+
+#define MS_DIRECT	256     /* Make all opens be O_DIRECT */
+
 #define MS_NOATIME	1024	/* Do not update access times. */
 #define MS_NODIRATIME	2048	/* Do not update directory access times */
 #define MS_BIND		4096


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2002-09-08 16:41 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20020907164631.GA17696@marowsky-bree.de>
2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer
2002-09-07 20:27   ` Rik van Riel
2002-09-07 21:14   ` [RFC] mount flag "direct" Lars Marowsky-Bree
2002-09-08  9:23     ` Peter T. Breuer
2002-09-08  9:59       ` Lars Marowsky-Bree
2002-09-08 16:46         ` Peter T. Breuer
2002-09-07 23:18   ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Andreas Dilger
2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer
2002-09-03 15:13 ` Rik van Riel
2002-09-03 15:53   ` Maciej W. Rozycki
2002-09-03 16:04     ` Peter T. Breuer
2002-09-03 16:08       ` Rik van Riel
2002-09-03 15:16 ` jbradford
2002-09-03 15:37 ` Anton Altaparmakov
2002-09-03 15:44   ` Peter T. Breuer
2002-09-03 16:23     ` Lars Marowsky-Bree
2002-09-03 16:41       ` Peter T. Breuer
2002-09-03 17:07         ` David Lang
2002-09-03 17:30           ` Peter T. Breuer
2002-09-03 17:40             ` David Lang
2002-09-04  5:57             ` Helge Hafting
2002-09-04  6:21               ` Peter T. Breuer
2002-09-04  6:49                 ` Helge Hafting
2002-09-04  9:15                   ` Peter T. Breuer
2002-09-04 11:34                     ` Helge Hafting
2002-09-03 17:26         ` Rik van Riel
2002-09-03 18:02           ` Andreas Dilger
2002-09-03 18:44             ` Daniel Phillips
2002-09-03 17:29         ` Jan Harkes
2002-09-03 18:31         ` Daniel Phillips
2002-09-03 18:20     ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).