linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 64-bit kdev_t - just for playing
@ 2003-03-27 20:27 Andries.Brouwer
  2003-03-27 22:12 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: Andries.Brouwer @ 2003-03-27 20:27 UTC (permalink / raw)
  To: Andries.Brouwer, zippel; +Cc: linux-kernel

> It would really help, if you would explain how a larger dev_t
> will work during 2.6.

> How is backward compatibility done, so that I can still boot a 2.4 kernel?

Old device numbers remain valid, so all changes are completely
transparent.

> How have user space utilities to be changed, which know about
> dev_t (e.g. ls or fdisk)?

If you do not use mknod to create device nodes with large device numbers,
then no new utilities are needed. If you really want to use large
device numbers, you need a new glibc; some utilities will require
recompilation because of the use of sysmacros.h.

Andries

^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-04-09 18:40 James Bottomley
  2003-04-09 20:54 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: James Bottomley @ 2003-04-09 18:40 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-kernel

> Let's try it with a real example:
> I have two onboard SCSI channels, the first one is for external
devices 
> and the second for internal devices.

There seems to be some confusion here.  As I understand it, you're
advocating a completely dynamic device space (both major and minor), but
the concrete examples you post come from devices that do dynamic minors
only.

Thanks to the work of Al Viro and others, the problem of dynamic majors
for block devices now lies predominantly in user space, but those
problems are still significant.

As far as SCSI goes, in the current 8/8 device scheme, we occupy 16
different majors for sd already, but this only gives us room for 256
discs and we had to compromise and only have 16 partitions on each (as
opposed to the 64 that IDE has).  Even if you expand this (as there are
patches to do), we get into trouble because we only have 256 sg nodes,
etc.

Expanding the size of the kernel dev_t will allow us to occupy only one
major,  for each SCSI device type, supply far more discs and still move
to a 64 partition model.

> I have to come back to the two questions I already asked earlier:
> 1. How do we want to manage devices in the future?

Well, it's a legitimate question to ask, but not one anyone is required
to answer.  The whole "taste" thing in the kernel is about making
correct decisions without necessarily seeing the ultimate end points. 
Enabling rather than dictating.  Nothing about an expanded kernel dev_t
precludes more dynamism in major number allocation.

However, there is already consideration of this issue, see for example:

http://www.linuxsymposium.org/2003/view_abstract.php?talk=94

If you have other contributions to make, I'm sure people will listen.

> 2. What compromises can we make for 2.6?

I think that's expand the kernel's device type but keep the current
static major/static or dynamic minor.  It seems to me, at this late
stage in the game, that this will cause the minimum disruption and
require the minimum of code changes, while still allowing us to satisfy
the enterprise device demands.  Pragmatically as well, we already have
the patches for this, we don't for dynamic majors.

James



^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-04-09 18:36 Andries.Brouwer
  2003-04-09 21:11 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: Andries.Brouwer @ 2003-04-09 18:36 UTC (permalink / raw)
  To: hpa, zippel; +Cc: Andries.Brouwer, linux-kernel

> Peter and Andries, I would really appreciate it, if
> you would stop ignoring me

But Roman, I answered a dozen of your letters, and
it seems to me that further letters would only
lead to repetition.

Let me recapitulate.

(i) On dev_t
A dev_t is a number. There are three streams:
mknod sends such a number from user space to the file system,
stat sends such a number from the file system to user space,
open and mount send such a number from the file system to the kernel
where the kernel finds an associated device.

Your questions are about the meaning of this number,
that is, about the third part. What I am doing is only
removing certain restrictions on the size of the number.


(ii) On device naming.
One can handle device naming in the old-fashioned way,
as Unix always did, or one can invent one of many possible
schemes for the future.
This old-fashioned way has a myriad of problems, but it works,
people are used to it and know precisely what the problems are,
and our present software handles it.
These schemes of the future are not yet crystallized out very well,
there are several to choose from, we do not yet know very well
what the problems will be, most of the software to handle such
new schemes still has to be written.

[And note that stat is an important system call - even with
new naming schemes we may need numbers - possibly some hash of
whatever ID we have found.]

Clearly, there will be a long transition period where these
schemes will coexist.

Now this old fashioned scheme has run into some limits -
it ran into them already several years ago, witness the
introduction of several scsi disk majors. Removing these limits
is not very difficult, so we do that now.

Your letters carry the tone of "it is forbidden to work on the
old scheme before you have shown how to solve all device naming
problems". But I am not going to.

You have opinions and questions about future schemes.
And so do I. But since time is limited I wrote you
already a handful of times: "Later".

This number stuff is simple and straightforward, we know precisely
what has to be done, but of course it needs to be done.

Naming on the other hand is intricate, lots of complications.
Device naming - but what is a device? Already that is complicated.
These are good discussions, and maybe sysfs will provide the answer
in certain cases, but these discussions are independent of dev_t.

Andries

^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-04-01 18:32 Andries.Brouwer
  0 siblings, 0 replies; 80+ messages in thread
From: Andries.Brouwer @ 2003-04-01 18:32 UTC (permalink / raw)
  To: alan, hch
  Cc: Andries.Brouwer, Joel.Becker, Wim.Coekaerts, ahu, greg,
	linux-kernel, zippel

> Why do we need a split at all?

Inside the kernel? No need at all.
But the Unix API is in terms of major,minor.

The NFS specification talks about major,minor.
The ISO 9660 (RockRidge) standard talks about major,minor.
Etc.

Inside the kernel we can do whatever we want, and no split
is required. In userspace such a split definitely exists.

See also mknod(1) and ls(1).

Andries

^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-03-31 23:41 Badari Pulavarty
  2003-03-31 23:54 ` William Lee Irwin III
                   ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Badari Pulavarty @ 2003-03-31 23:41 UTC (permalink / raw)
  To: Joel.Becker; +Cc: linux-kernel


> I'm right here campaigning loudly for a larger dev_t. I intend 
> to never, ever make assumptions about dev_t. In fact, I'd rather not 
> deal with dev_t. But I do need a way to map 4k or 8k or 16k disks. 
> now. 
>
> Joel

 Hi Joel,

I have been playing with supporting 4000 disks on IA32 machines.
There are bunch of issues we need to resolve before we could
do that.

I am using scsi_debug to simulate 4000 disks. (Ofcourse, I had
to hack "sd" to support more than 256 disks). Anyway, I noticed
that I lost almost 350MB of my lowmem, when I simulated 4000 disks.
We are working on most of these. But there are userlevel issues
to be resolved. Here is the list ...

1) deadline_drq, blkdev_request consume 80 MB of low memory.
      - Jens is looking at it. He is working on a patch to allocate
requests dynamically.

2) sysfs inode use up 50 MB of low memory
        - 4000 disks without partitions create (4000 * 35) = 140,000 inodes in 
/sysfs.  So, it uses 50 MB of lowmem. 

3) dcache is eating up 25 MB of low memory.

4) kmalloc() slabs are consuming 55 MB. We are in the process
of identifying the heavy consumers and fixing them.
 	- Jens is fixing hash table size issues with io-schedulers.
	- I have patch to allocate "hd_struct" dynamically. So, if your disks
          does not have any partitions, you don't use any more memory.    

5) glibc and utility issues - lots of stuff are broken
    (Need a new libc)
        - mknod
        - ls
        - raw command
        - etc..

I have not done any IO on these yet. When I mount all of these and do
IO on them, we might see new issues. So with all these, I will be doubtful
if we can ever reach 16k disks on IA32.

Thanks,
Badari



^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-03-28 15:33 Andries.Brouwer
  2003-03-28 15:49 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: Andries.Brouwer @ 2003-03-28 15:33 UTC (permalink / raw)
  To: Andries.Brouwer, zippel; +Cc: alan, greg, linux-kernel

    On Fri, 28 Mar 2003 Andries.Brouwer@cwi.nl wrote:

    > > the actual size of this number is only a small detail
    > 
    > Yes. It is that detail I am concerned with.

    If you don't want to or can't answer my question, it means I can revert 
    your character device changes?

I am not Linus. You can send him whatever you think best
and see whether he applies it.

I would prefer if you waited a bit. This little detail,
changing the size of dev_t, requires an audit of the
kernel source. That takes some time.
I would much prefer postponing discussion about device
handling until after number handling is in good shape.

Generally it is a bad idea when two people simultaneously
change the same code.

Andries



^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-03-28 11:46 Andries.Brouwer
  2003-03-28 11:57 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: Andries.Brouwer @ 2003-03-28 11:46 UTC (permalink / raw)
  To: Andries.Brouwer, zippel; +Cc: alan, greg, linux-kernel

> the actual size of this number is only a small detail

Yes. It is that detail I am concerned with.

^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-03-28 11:10 Andries.Brouwer
  2003-03-28 11:36 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: Andries.Brouwer @ 2003-03-28 11:10 UTC (permalink / raw)
  To: greg, zippel; +Cc: Andries.Brouwer, alan, linux-kernel

Roman, Your questions are misguided.
A larger dev_t is infrastructure.
A sand road that is turned into an asphalt road.

Nobody has to use this improved infrastructure.
But many uses are conceivable.

Long ago I reserved 2^40 values for dynamically
assigned anonymous devices. Convenient, a very
small fraction of the available space.

I can imagine that there will be people wanting
to take part of the available space for a universal
hash of disk serial number or partition label or
I don't know what, so that devices are addressable
by content instead of path.

A lot of room can be put to many uses.

Andries

^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: 64-bit kdev_t - just for playing
@ 2003-03-27 22:37 Andries.Brouwer
  2003-03-27 22:55 ` Roman Zippel
  0 siblings, 1 reply; 80+ messages in thread
From: Andries.Brouwer @ 2003-03-27 22:37 UTC (permalink / raw)
  To: Andries.Brouwer, zippel; +Cc: linux-kernel

> Can I have now more than 15 partitions?

Two years ago I amused myself creating a few hundred partitions
on a SCSI disk. Yes, no doubt the availability of numbers will
remove the current limits on the number of partitions of a disk.

But, as I answered you several times already,
right now my topic is dev_t, not devices or partitions.
Just the number.

Andries


^ permalink raw reply	[flat|nested] 80+ messages in thread
* 64-bit kdev_t - just for playing
@ 2003-03-27  1:09 Andries.Brouwer
  2003-03-27 19:23 ` Roman Zippel
  2003-03-30 20:10 ` H. Peter Anvin
  0 siblings, 2 replies; 80+ messages in thread
From: Andries.Brouwer @ 2003-03-27  1:09 UTC (permalink / raw)
  To: Joel.Becker; +Cc: linux-kernel

>> Maybe I should send another patch tonight, just for playing.

> Please, I'd like that.

Below a random version of kdev_t.h.
(The file is smaller than the patch.)

kdev_t is the kernel-internal representation
dev_t is the kernel idea of the user space representation
(of course glibc uses 64 bits, split up as 8+8 :-)

kdev_t can be equal to dev_t.

The file below completely randomly makes kdev_t
64 bits, split up 32+32, and dev_t 32 bits, split up 12+20.

Andries

------------------------------------------------------------
#ifndef _LINUX_KDEV_T_H
#define _LINUX_KDEV_T_H
#ifdef __KERNEL__

typedef struct {
	unsigned long long value;
} kdev_t;

#define KDEV_MINOR_BITS		32
#define KDEV_MAJOR_BITS		32
#define KDEV_MINOR_MASK		((1ULL << KDEV_MINOR_BITS) - 1)

#define __mkdev(major, minor)	(((unsigned long long)(major) << KDEV_MINOR_BITS) + (minor))

#define mk_kdev(major, minor)	((kdev_t) { __mkdev(major,minor) } )

/*
 * The "values" are just _cookies_, usable for 
 * internal equality comparisons and for things
 * like NFS filehandle conversion.
 */
static inline unsigned long long kdev_val(kdev_t dev)
{
	return dev.value;
}

static inline kdev_t val_to_kdev(unsigned long long val)
{
	kdev_t dev;
	dev.value = val;
	return dev;
}

#define HASHDEV(dev)	(kdev_val(dev))
#define NODEV		(mk_kdev(0,0))

extern const char * kdevname(kdev_t);	/* note: returns pointer to static data! */

static inline int kdev_same(kdev_t dev1, kdev_t dev2)
{
	return dev1.value == dev2.value;
}

#define kdev_none(d1)	(!kdev_val(d1))

#define minor(dev)	(unsigned int)((dev).value & KDEV_MINOR_MASK)
#define major(dev)	(unsigned int)((dev).value >> KDEV_MINOR_BITS)

/* These are for user-level "dev_t" */
#define MINORBITS	8
#define MINORMASK	((1U << MINORBITS) - 1)
#define DEV_MINOR_BITS	20
#define	DEV_MAJOR_BITS	12
#define	DEV_MINOR_MASK	((1U << DEV_MINOR_BITS) - 1)
#define DEV_MAJOR_MASK	((1U << DEV_MAJOR_BITS) - 1)

#include <linux/types.h>	/* dev_t */

#define MAJOR(dev)	((unsigned int)(((dev) & 0xffff0000) ? ((dev) >> DEV_MINOR_BITS) & DEV_MAJOR_MASK : ((dev) >> 8) & 0xff))
#define MINOR(dev)	((unsigned int)(((dev) & 0xffff0000) ? ((dev) & DEV_MINOR_MASK) : ((dev) & 0xff)))
#define MKDEV(ma,mi)	((dev_t)((((ma) & ~0xff) == 0 && ((mi) & ~0xff) == 0) ? (((ma) << 8) | (mi)) : (((ma) << DEV_MINOR_BITS) | (mi))))

/*
 * Conversion functions
 */

static inline int kdev_t_to_nr(kdev_t dev)
{
	unsigned int ma = major(dev);
	unsigned int mi = minor(dev);
	return MKDEV(ma, mi);
}

static inline kdev_t to_kdev_t(dev_t dev)
{
	unsigned int ma = MAJOR(dev);
	unsigned int mi = MINOR(dev);
	return mk_kdev(ma, mi);
}

#else /* __KERNEL__ */

/*
Some programs want their definitions of MAJOR and MINOR and MKDEV
from the kernel sources. These must be the externally visible ones.
Of course such programs should be updated.
*/
#define MAJOR(dev)	((dev)>>8)
#define MINOR(dev)	((dev) & 0xff)
#define MKDEV(ma,mi)	((ma)<<8 | (mi))
#endif /* __KERNEL__ */
#endif

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2003-04-11  0:59 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-27 20:27 64-bit kdev_t - just for playing Andries.Brouwer
2003-03-27 22:12 ` Roman Zippel
2003-03-27 22:55   ` Alan Cox
2003-03-27 23:19     ` Roman Zippel
2003-03-27 23:48       ` Greg KH
2003-03-28  9:47         ` Roman Zippel
2003-03-28 18:05           ` Joel Becker
2003-03-28 18:48             ` Roman Zippel
2003-03-31  8:31               ` bert hubert
2003-03-31  8:52                 ` Roman Zippel
2003-03-31 17:24                   ` Joel Becker
2003-03-31 21:32                     ` Roman Zippel
2003-03-31 22:18                       ` Alan Cox
2003-03-31 23:42                         ` Roman Zippel
2003-04-01 14:42                           ` Alan Cox
2003-04-01 16:35                             ` Greg KH
2003-04-02 13:02                               ` Roman Zippel
2003-04-01 14:42                           ` Alan Cox
2003-04-01 16:52                             ` Christoph Hellwig
2003-04-01 21:59                             ` H. Peter Anvin
2003-04-02  7:12                               ` Christoph Hellwig
2003-04-02  7:22                                 ` H. Peter Anvin
2003-03-31 23:45                         ` Joel Becker
2003-03-31 23:07                       ` Joel Becker
2003-03-31 23:35                         ` Roman Zippel
  -- strict thread matches above, loose matches on Subject: below --
2003-04-09 18:40 James Bottomley
2003-04-09 20:54 ` Roman Zippel
2003-04-10  2:19   ` James Bottomley
2003-04-10 12:47     ` Roman Zippel
2003-04-10 15:30       ` James Bottomley
2003-04-10 23:53         ` Roman Zippel
2003-04-11  0:01           ` David Lang
2003-04-11  0:17             ` Roman Zippel
2003-04-11  0:47           ` Joel Becker
2003-04-11  1:11             ` Roman Zippel
2003-04-09 18:36 Andries.Brouwer
2003-04-09 21:11 ` Roman Zippel
2003-04-01 18:32 Andries.Brouwer
2003-03-31 23:41 Badari Pulavarty
2003-03-31 23:54 ` William Lee Irwin III
2003-03-31 23:55 ` Joel Becker
2003-04-02 12:18 ` Roman Zippel
2003-04-02 17:31   ` Badari Pulavarty
2003-04-02 22:03     ` H. Peter Anvin
2003-04-03 10:09       ` David Lang
2003-04-03 11:14         ` Lars Marowsky-Bree
2003-04-03 12:13     ` Roman Zippel
2003-04-03 13:37       ` Andries Brouwer
2003-04-03 14:01         ` Roman Zippel
2003-04-07 15:02           ` H. Peter Anvin
2003-04-07 20:10             ` Roman Zippel
2003-04-07 21:57               ` Roman Zippel
2003-04-07 22:43                 ` Kevin P. Fleming
2003-04-08 15:22                   ` Roman Zippel
2003-04-08 22:53                 ` Werner Almesberger
2003-04-08 23:11                   ` David Lang
2003-04-08 23:47                     ` Werner Almesberger
2003-04-08 23:58                       ` Kevin P. Fleming
2003-04-08 23:56                     ` H. Peter Anvin
2003-04-08 23:06                       ` Andrew Morton
2003-04-09  0:40                       ` Roman Zippel
2003-04-09  1:02                         ` Joel Becker
2003-04-09  1:25                           ` Roman Zippel
2003-04-09 16:42                       ` Roman Zippel
2003-04-09  0:21                   ` Roman Zippel
2003-04-11  9:58               ` Pavel Machek
2003-04-08 15:29             ` Roman Zippel
2003-03-28 15:33 Andries.Brouwer
2003-03-28 15:49 ` Roman Zippel
2003-03-28 11:46 Andries.Brouwer
2003-03-28 11:57 ` Roman Zippel
2003-03-28 11:10 Andries.Brouwer
2003-03-28 11:36 ` Roman Zippel
2003-03-30 20:05   ` H. Peter Anvin
2003-03-30 20:13     ` Roman Zippel
2003-03-27 22:37 Andries.Brouwer
2003-03-27 22:55 ` Roman Zippel
2003-03-27  1:09 Andries.Brouwer
2003-03-27 19:23 ` Roman Zippel
2003-03-30 20:10 ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).