linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch for playing] 2.5.65 patch to support > 256 disks
@ 2003-03-21 18:56 Badari Pulavarty
  2003-03-22 11:00 ` Douglas Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-21 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

Hi,

Andries Brouwer recently submitted 32 bit dev_t patches,
which are in 2.5.65-mm2. This patch applies on those patches to support 
more than 256 disks.  This is for playing only.

I tested this with 4000 disks using scsi_debug. I attached my actual
disks (50) after 4000 scsi_debug disks. I am able to access my disks
fine and do IO on them.

Problems (so far):

1) sd.c -  sd_index_bits[] arrys became big - need to be fixed.

2) 4000 disks eats up lots of low memory (~460 MB). Here is the
/proc/meminfo output before & after insmod.

Before:
MemTotal:      3883276 kB
MemFree:       3808028 kB
Buffers:          3240 kB
Cached:          41860 kB
SwapCached:          0 kB
Active:          45360 kB
Inactive:         7288 kB
HighTotal:     3014616 kB
HighFree:      2961856 kB
LowTotal:       868660 kB
LowFree:        846172 kB
SwapTotal:     2040244 kB
SwapFree:      2040244 kB
Dirty:             192 kB
Writeback:           0 kB
Mapped:          14916 kB
Slab:             7164 kB
Committed_AS:    12952 kB
PageTables:        312 kB
ReverseMaps:      1895
====
After:
MemTotal:      3883276 kB
MemFree:       3224140 kB
Buffers:          3880 kB
Cached:         140376 kB
SwapCached:          0 kB
Active:          47512 kB
Inactive:       105508 kB
HighTotal:     3014616 kB
HighFree:      2838144 kB
LowTotal:       868660 kB
LowFree:        385996 kB
SwapTotal:     2040244 kB
SwapFree:      2040244 kB
Dirty:              92 kB
Writeback:           0 kB
Mapped:          16172 kB
Slab:           464364 kB
Committed_AS:    14996 kB
PageTables:        412 kB
ReverseMaps:      2209

[-- Attachment #2: sd.patch --]
[-- Type: text/x-diff, Size: 1353 bytes --]

--- linux/drivers/scsi/sd.c	Thu Mar 20 15:06:00 2003
+++ linux.new/drivers/scsi/sd.c	Fri Mar 21 11:50:54 2003
@@ -56,7 +56,9 @@
  * Remaining dev_t-handling stuff
  */
 #define SD_MAJORS	16
-#define SD_DISKS	(SD_MAJORS << 4)
+#define SD_DISKS_PER_MAJOR_SHIFT	(KDEV_MINOR_BITS - 4)
+#define SD_DISKS_PER_MAJOR	(1 << SD_DISKS_PER_MAJOR_SHIFT)
+#define SD_DISKS	(SD_MAJORS << SD_DISKS_PER_MAJOR_SHIFT)
 
 /*
  * Time out in seconds for disks and Magneto-opticals (which are slower).
@@ -1328,17 +1330,23 @@ static int sd_attach(struct scsi_device 
 	sdkp->index = index;
 
 	gd->de = sdp->de;
-	gd->major = sd_major(index >> 4);
-	gd->first_minor = (index & 15) << 4;
+	gd->major = sd_major(index >> SD_DISKS_PER_MAJOR_SHIFT);
+	gd->first_minor = (index & (SD_DISKS_PER_MAJOR - 1)) << 4;
 	gd->minors = 16;
 	gd->fops = &sd_fops;
 
-	if (index >= 26) {
+	if (index < 26) {
+		sprintf(gd->disk_name, "sd%c", 'a' + index % 26);
+	} else if (index < (26*27)) {
 		sprintf(gd->disk_name, "sd%c%c",
 			'a' + index/26-1,'a' + index % 26);
 	} else {
-		sprintf(gd->disk_name, "sd%c", 'a' + index % 26);
-	}
+		const unsigned int m1 = (index/ 26 - 1) / 26 - 1;
+		const unsigned int m2 = (index / 26 - 1) % 26;
+		const unsigned int m3 = index % 26;
+		sprintf(gd->disk_name, "sd%c%c%c", 
+			'a' + m1, 'a' + m2, 'a' + m3);
+	}	
 
 	sd_init_onedisk(sdkp, gd);
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-21 18:56 [patch for playing] 2.5.65 patch to support > 256 disks Badari Pulavarty
@ 2003-03-22 11:00 ` Douglas Gilbert
  2003-03-22 11:04   ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Douglas Gilbert @ 2003-03-22 11:00 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-scsi, Andrew Morton

Badari Pulavarty wrote:
> Hi,
> 
> Andries Brouwer recently submitted 32 bit dev_t patches,
> which are in 2.5.65-mm2. This patch applies on those patches to support 
> more than 256 disks.  This is for playing only.
> 
> I tested this with 4000 disks using scsi_debug. I attached my actual
> disks (50) after 4000 scsi_debug disks. I am able to access my disks
> fine and do IO on them.
> 
> Problems (so far):
> 
> 1) sd.c -  sd_index_bits[] arrys became big - need to be fixed.
> 
> 2) 4000 disks eats up lots of low memory (~460 MB). Here is the
> /proc/meminfo output before & after insmod.
> 
> Before:
> MemTotal:      3883276 kB
> MemFree:       3808028 kB
> Buffers:          3240 kB
> Cached:          41860 kB
> SwapCached:          0 kB
> Active:          45360 kB
> Inactive:         7288 kB
> HighTotal:     3014616 kB
> HighFree:      2961856 kB
> LowTotal:       868660 kB
> LowFree:        846172 kB
> SwapTotal:     2040244 kB
> SwapFree:      2040244 kB
> Dirty:             192 kB
> Writeback:           0 kB
> Mapped:          14916 kB
> Slab:             7164 kB
> Committed_AS:    12952 kB
> PageTables:        312 kB
> ReverseMaps:      1895
> ====
> After:
> MemTotal:      3883276 kB
> MemFree:       3224140 kB
> Buffers:          3880 kB
> Cached:         140376 kB
> SwapCached:          0 kB
> Active:          47512 kB
> Inactive:       105508 kB
> HighTotal:     3014616 kB
> HighFree:      2838144 kB
> LowTotal:       868660 kB
> LowFree:        385996 kB
> SwapTotal:     2040244 kB
> SwapFree:      2040244 kB
> Dirty:              92 kB
> Writeback:           0 kB
> Mapped:          16172 kB
> Slab:           464364 kB
> Committed_AS:    14996 kB
> PageTables:        412 kB
> ReverseMaps:      2209

Badari,
I poked around looking for data on the size issue.
Here are the byte sizes for the per device and per host
structures in scsi_debug and the scsi mid level for
i386, non-smp in lk 2.5.65:
   sizeof(sdebug_dev_info)=60, sizeof(scsi_device)=376
     sizeof(sdebug_host_info)=24, sizeof(Scsi_Host)=224

So for 4000 disks they should be responsible for about
2 MB.

The scsi_cmd_cache slab info went from this before those
84 pseudo disks were added:
# cat slabinfo_pre.txt
scsi_cmd_cache         3     11    356    1    1    1 :   32   16 : 
22     301     6    5    0    0   43 :   3235     31   3264      0
to this afterwards:
# cat slabinfo_post.txt
scsi_cmd_cache        44     55    356    5    5    1 :   32   16 : 
66     398    12    7    0    0   43 :   5837     40   5833      0

I did notice a rather large growth of nodes
in sysfs. For 84 added scsi_debug pseudo disks the number
of sysfs nodes went from 686 to 3347.

Does anybody know what is the per node memory cost of sysfs?

Doug Gilbert




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-22 11:00 ` Douglas Gilbert
@ 2003-03-22 11:04   ` Andrew Morton
  2003-03-22 11:46     ` Douglas Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2003-03-22 11:04 UTC (permalink / raw)
  To: dougg; +Cc: pbadari, linux-kernel, linux-scsi

Douglas Gilbert <dougg@torque.net> wrote:
>
> > Slab:           464364 kB

It's all in slab.

> I did notice a rather large growth of nodes
> in sysfs. For 84 added scsi_debug pseudo disks the number
> of sysfs nodes went from 686 to 3347.
> 
> Does anybody know what is the per node memory cost of sysfs?

Let's see all of /pro/slabinfo please.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-22 11:04   ` Andrew Morton
@ 2003-03-22 11:46     ` Douglas Gilbert
  2003-03-22 12:05       ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Douglas Gilbert @ 2003-03-22 11:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, linux-kernel, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 532 bytes --]

Andrew Morton wrote:
> Douglas Gilbert <dougg@torque.net> wrote:
> 
>>>Slab:           464364 kB
>>
> 
> It's all in slab.
> 
> 
>>I did notice a rather large growth of nodes
>>in sysfs. For 84 added scsi_debug pseudo disks the number
>>of sysfs nodes went from 686 to 3347.
>>
>>Does anybody know what is the per node memory cost of sysfs?
> 
> 
> Let's see all of /pro/slabinfo please.

Andrew,
Attachments are /proc/slabinfo pre and post:
  $ modprobe scsi_debug add_host=42 num_devs=2
which adds 84 pseudo disks.

Doug Gilbert


[-- Attachment #2: slabinfo_pre.txt --]
[-- Type: text/plain, Size: 14921 bytes --]

slabinfo - version: 1.2 (statistics)
hpsb_packet            0      0    132    0    0    1 :   32   16 :     16      16     1    1    0    0   61 :     46      1     47      0
rpc_buffers            8      8   2048    4    4    1 :   32   16 :      8       8     4    0    0    0   34 :      4      4      0      0
rpc_tasks              8     19    204    1    1    1 :   32   16 :     16      16     1    0    0    0   51 :      7      1      0      0
rpc_inode_cache        0      0    480    0    0    1 :   32   16 :      0       0     0    0    0    0   40 :      0      0      0      0
unix_sock             10     21    576    3    3    1 :   32   16 :     21      32     3    0    0    0   39 :    593      4    587      0
tcp_tw_bucket          0      0    100    0    0    1 :   32   16 :      0       0     0    0    0    0   71 :      0      0      0      0
tcp_bind_bucket        7    125     28    1    1    1 :   32   16 :     16      16     1    0    0    0  157 :      6      1      0      0
tcp_open_request       0      0     68    0    0    1 :   32   16 :      0       0     0    0    0    0   88 :      0      0      0      0
inet_peer_cache        0      0     52    0    0    1 :   32   16 :      0       0     0    0    0    0  104 :      0      0      0      0
secpath_cache          0      0     32    0    0    1 :   32   16 :      0       0     0    0    0    0  144 :      0      0      0      0
flow_cache             0      0     60    0    0    1 :   32   16 :      0       0     0    0    0    0   94 :      0      0      0      0
xfrm4_dst_cache        0      0    220    0    0    1 :   32   16 :      0       0     0    0    0    0   50 :      0      0      0      0
ip_fib_hash           10    125     28    1    1    1 :   32   16 :     16      16     1    0    0    0  157 :     10      1      1      0
ip_dst_cache           4     19    208    1    1    1 :   32   16 :     19      48     1    0    0    0   51 :      3      3      2      0
arp_cache              2     21    188    1    1    1 :   32   16 :     17      32     1    0    0    0   53 :      0      2      0      0
raw4_sock              0      0    576    0    0    1 :   32   16 :      7      14     2    2    0    0   39 :      4      2      6      0
udp_sock               7      7    576    1    1    1 :   32   16 :     14      17     2    1    0    0   39 :     37      3     33      0
tcp_sock              16     21   1088    3    3    2 :   32   16 :     21      30     3    0    0    0   39 :     49      6     39      0
scsi_cmd_cache         4     11    356    1    1    1 :   32   16 :     22     141     4    3    0    0   43 :   3155     14   3167      0
nfs_write_data        36     40    452    5    5    1 :   32   16 :     40      40     5    0    0    0   40 :     31      5      0      0
nfs_read_data         32     36    436    4    4    1 :   32   16 :     36      36     4    0    0    0   41 :     28      4      0      0
nfs_inode_cache        0      0    704    0    0    2 :   32   16 :      0       0     0    0    0    0   43 :      0      0      0      0
nfs_page               0      0    108    0    0    1 :   32   16 :      0       0     0    0    0    0   68 :      0      0      0      0
isofs_inode_cache      0      0    440    0    0    1 :   32   16 :      0       0     0    0    0    0   41 :      0      0      0      0
ext2_inode_cache       4      7    576    1    1    1 :   32   16 :      7       7     1    0    0    0   39 :      3      1      0      0
journal_head          26     62     60    1    1    1 :   32   16 :    420    1092     9    2    0    1   94 :   3741     71   3746     55
revoke_table           1    144     24    1    1    1 :   32   16 :     16      16     1    0    0    0  176 :      0      1      0      0
revoke_record          0      0     28    0    0    1 :   32   16 :     16      32     2    2    0    0  157 :      3      2      5      0
ext3_inode_cache    2268   2268    576  324  324    1 :   32   16 :   2268    2279   324    0    0    0   39 :   1967    329     28      0
ext3_xattr             0      0     60    0    0    1 :   32   16 :      0       0     0    0    0    0   94 :      0      0      0      0
eventpoll_pwq          0      0     48    0    0    1 :   32   16 :      0       0     0    0    0    0  109 :      0      0      0      0
eventpoll_epi          0      0     72    0    0    1 :   32   16 :      0       0     0    0    0    0   85 :      0      0      0      0
kioctx                 0      0    268    0    0    1 :   32   16 :      0       0     0    0    0    0   46 :      0      0      0      0
kiocb                  0      0    172    0    0    1 :   32   16 :      0       0     0    0    0    0   55 :      0      0      0      0
dnotify_cache          0      0     32    0    0    1 :   32   16 :      0       0     0    0    0    0  144 :      0      0      0      0
file_lock_cache       17     30    128    1    1    1 :   32   16 :     30     160     1    0    0    0   62 :   1052     12   1047      0
fasync_cache           0      0     28    0    0    1 :   32   16 :      0       0     0    0    0    0  157 :      0      0      0      0
shmem_inode_cache      2      7    576    1    1    1 :   32   16 :      7      13     1    0    0    0   39 :      0      2      0      0
idr_layer_cache        0      0    148    0    0    1 :   32   16 :      0       0     0    0    0    0   58 :      0      0      0      0
posix_timers_cache      0      0    136    0    0    1 :   32   16 :      0       0     0    0    0    0   60 :      0      0      0      0
uid_cache              5    101     36    1    1    1 :   32   16 :     18      32     1    0    0    0  133 :      3      2      0      0
sgpool-MAX_PHYS_SEGMENTS     32     32   2048   16   16    1 :   32   16 :     34      38    18    2    0    0   34 :      0     35      3      0
sgpool-64             32     32   1024    8    8    1 :   32   16 :     36      48    12    4    0    0   36 :      6     36     10      0
sgpool-32             32     32    512    4    4    1 :   32   16 :     40      72     9    5    0    0   40 :     33     37     38      0
sgpool-16             32     42    268    3    3    1 :   32   16 :     42      72     3    0    0    0   46 :    225     36    229      0
sgpool-8              48     54    140    2    2    1 :   32   16 :     54     195     2    0    0    0   59 :   2832     43   2843      0
deadline_drq         768    826     64   14   14    1 :   32   16 :   1035    1302    21    0    0    0   91 :   1191     89    482     30
blkdev_requests      768    775    156   31   31    1 :   32   16 :   1025    1264    47    3    0    1   57 :   1180    100    484     28
biovec-BIO_MAX_PAGES    256    260   3072   52   52    4 :   32   16 :    256     256    52    0    0    0   37 :      0    256      0      0
biovec-128           256    260   1536   52   52    2 :   32   16 :    256     256    52    0    0    0   37 :      0    256      0      0
biovec-64            256    260    768   52   52    1 :   32   16 :    260     268    52    0    0    0   37 :      8    259     11      0
biovec-16            256    266    204   14   14    1 :   32   16 :    266     286    14    0    0    0   51 :    291    259    294      0
biovec-4             256    310     60    5    5    1 :   32   16 :    272     288     5    0    0    0   94 :    481    258    483      0
biovec-1             267    288     24    2    2    1 :   32   16 :    512    1229     6    3    0    0  176 :   5079    317   5093     47
bio                  266    318     72    6    6    1 :   32   16 :    509    1337    13    2    0    1   85 :   5855    329   5874     54
sock_inode_cache      36     40    484    5    5    1 :   32   16 :     40      55     5    0    0    0   40 :    545     10    519      0
skbuff_head_cache     16     24    160    1    1    1 :   32   16 :     24      42     1    0    0    0   56 :     13      3      0      0
sock                   2      9    444    1    1    1 :   32   16 :      9      16     1    0    0    0   41 :     14      3     15      0
proc_inode_cache      82     99    440   11   11    1 :   32   16 :     99     147    11    0    0    0   41 :    572     15    519      1
sigqueue               9     27    144    1    1    1 :   32   16 :     16      16     1    0    0    0   59 :    744      1    745      0
radix_tree_node      805    812    272   58   58    1 :   32   16 :    812     840    58    0    0    0   46 :    743     69     11      0
cdev_cache           143    250     28    2    2    1 :   32   16 :    158     173     2    0    0    0  157 :    131     12      0      0
bdev_cache             4     32    120    1    1    1 :   32   16 :     16      16     1    0    0    0   64 :      6      1      3      0
mnt_cache             19     53     72    1    1    1 :   32   16 :     35      54     1    0    0    0   85 :     24      9     14      0
inode_cache          503    513    424   57   57    1 :   32   16 :    513     546    57    0    0    0   41 :    703    305    505      0
dentry_cache        3854   3857    204  203  203    1 :   32   16 :   3968    4474   216    2    0    1   51 :   5098    699   1909     34
filp                 286    300    156   12   12    1 :   32   16 :    291     307    12    0    0    0   57 :    261     25      0      0
names_cache            1      1   4096    1    1    1 :   32   16 :      8      44    22   21    0    0   33 :  14723     44  14767      0
buffer_head         1506   1534     64   26   26    1 :   32   16 :   1523    2071    26    0    0    1   91 :   3797    139   2404     28
mm_struct             35     35    576    5    5    1 :   32   16 :     35      73     5    0    0    0   39 :   1324      9   1307      0
vm_area_struct       500    500     76   10   10    1 :   32   16 :    527    8798    13    1    0    1   82 :  26991    558  26555    513
fs_cache              40     84     44    1    1    1 :   32   16 :     40      96     2    1    0    0  116 :    736      6    717      0
files_cache           27     27    424    3    3    1 :   32   16 :     27      53     4    1    0    0   41 :    734      8    717      0
signal_cache          65     72     52    1    1    1 :   32   16 :     65     134     1    0    0    0  104 :    743     14    707      0
sighand_cache         45     45   1344   15   15    1 :   32   16 :     48      60    16    1    0    0   35 :    724     23    703      0
task_struct           55     55   1632   11   11    2 :   32   16 :     55      68    11    0    0    0   37 :    736     21    704      0
pte_chain           1650   1650     76   33   33    1 :   32   16 :   1650    6660    42    1    0    1   82 :   9854    453   8357    310
pgd                   27     27   4096   27   27    1 :   32   16 :     33      46    43   16    0    0   33 :   1288     45   1307      0
size-131072(DMA)       0      0 131072    0    0   32 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-131072            0      0 131072    0    0   32 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-65536(DMA)        0      0  65536    0    0   16 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-65536             0      0  65536    0    0   16 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-32768(DMA)        0      0  32768    0    0    8 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-32768             0      0  32768    0    0    8 :    8    4 :      1       1     1    1    0    0    9 :      0      1      1      0
size-16384(DMA)        0      0  16384    0    0    4 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-16384             1      1  16384    1    1    4 :    8    4 :      2       4     4    3    0    0    9 :      2      4      5      0
size-8192(DMA)         0      0   8192    0    0    2 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-8192              6      6   8192    6    6    2 :    8    4 :      7      10     9    3    0    0    9 :     24     10     28      0
size-4096(DMA)         0      0   4096    0    0    1 :   32   16 :      0       0     0    0    0    0   33 :      0      0      0      0
size-4096             27     27   4096   27   27    1 :   32   16 :     34      50    49   22    0    0   33 :    343     50    366      0
size-2048(DMA)         0      0   2048    0    0    1 :   32   16 :      0       0     0    0    0    0   34 :      0      0      0      0
size-2048             10     10   2048    5    5    1 :   32   16 :     10      12     5    0    0    0   34 :      4      7      1      0
size-1024(DMA)         0      0   1024    0    0    1 :   32   16 :      0       0     0    0    0    0   36 :      0      0      0      0
size-1024             40     40   1024   10   10    1 :   32   16 :     40      61    11    1    0    0   36 :    531     37    530      0
size-512(DMA)          0      0    512    0    0    1 :   32   16 :      8       8     1    1    0    0   40 :      0      1      1      0
size-512              62     64    512    8    8    1 :   32   16 :     72      94     9    1    0    0   40 :    426     21    385      0
size-256(DMA)          0      0    268    0    0    1 :   32   16 :     14      14     1    1    0    0   46 :     31      1     32      0
size-256               9     14    268    1    1    1 :   32   16 :     28      60     2    1    0    0   46 :    594      7    592      0
size-192(DMA)          0      0    204    0    0    1 :   32   16 :      0       0     0    0    0    0   51 :      0      0      0      0
size-192             133    133    204    7    7    1 :   32   16 :    133     188     7    0    0    0   51 :    481     65    422      0
size-128(DMA)          0      0    140    0    0    1 :   32   16 :      0       0     0    0    0    0   59 :      0      0      0      0
size-128             554    621    140   23   23    1 :   32   16 :    617     811    23    0    0    0   59 :   1781    287   1514      6
size-64(DMA)           0      0     76    0    0    1 :   32   16 :      0       0     0    0    0    0   82 :      0      0      0      0
size-64              338    400     76    8    8    1 :   32   16 :    366     589     8    0    0    0   82 :    285    213    144     27
size-32(DMA)           0      0     44    0    0    1 :   32   16 :      0       0     0    0    0    0  116 :      0      0      0      0
size-32              232    252     44    3    3    1 :   32   16 :    232     399     3    0    0    0  116 :  14239    174  14042    142
kmem_cache           110    110    180    5    5    1 :   32   16 :    110     110     5    0    0    0   54 :     34     72      0      0

[-- Attachment #3: slabinfo_post.txt --]
[-- Type: text/plain, Size: 14921 bytes --]

slabinfo - version: 1.2 (statistics)
hpsb_packet            0      0    132    0    0    1 :   32   16 :     16      16     1    1    0    0   61 :     46      1     47      0
rpc_buffers            8      8   2048    4    4    1 :   32   16 :      8       8     4    0    0    0   34 :      4      4      0      0
rpc_tasks              8     19    204    1    1    1 :   32   16 :     16      16     1    0    0    0   51 :      7      1      0      0
rpc_inode_cache        0      0    480    0    0    1 :   32   16 :      0       0     0    0    0    0   40 :      0      0      0      0
unix_sock             10     21    576    3    3    1 :   32   16 :     21      32     3    0    0    0   39 :    593      4    587      0
tcp_tw_bucket          0      0    100    0    0    1 :   32   16 :      0       0     0    0    0    0   71 :      0      0      0      0
tcp_bind_bucket        7    125     28    1    1    1 :   32   16 :     16      16     1    0    0    0  157 :      6      1      0      0
tcp_open_request       0      0     68    0    0    1 :   32   16 :      0       0     0    0    0    0   88 :      0      0      0      0
inet_peer_cache        0      0     52    0    0    1 :   32   16 :      0       0     0    0    0    0  104 :      0      0      0      0
secpath_cache          0      0     32    0    0    1 :   32   16 :      0       0     0    0    0    0  144 :      0      0      0      0
flow_cache             0      0     60    0    0    1 :   32   16 :      0       0     0    0    0    0   94 :      0      0      0      0
xfrm4_dst_cache        0      0    220    0    0    1 :   32   16 :      0       0     0    0    0    0   50 :      0      0      0      0
ip_fib_hash           10    125     28    1    1    1 :   32   16 :     16      16     1    0    0    0  157 :     10      1      1      0
ip_dst_cache           5     19    208    1    1    1 :   32   16 :     19      63     1    0    0    0   51 :      3      4      2      0
arp_cache              2     21    188    1    1    1 :   32   16 :     17      32     1    0    0    0   53 :      0      2      0      0
raw4_sock              0      0    576    0    0    1 :   32   16 :      7      14     2    2    0    0   39 :      4      2      6      0
udp_sock               7      7    576    1    1    1 :   32   16 :     14      17     2    1    0    0   39 :     37      3     33      0
tcp_sock              16     21   1088    3    3    2 :   32   16 :     21      30     3    0    0    0   39 :     49      6     39      0
scsi_cmd_cache        66     66    356    6    6    1 :   32   16 :     66     220     9    3    0    0   43 :   5735     22   5713      0
nfs_write_data        36     40    452    5    5    1 :   32   16 :     40      40     5    0    0    0   40 :     31      5      0      0
nfs_read_data         32     36    436    4    4    1 :   32   16 :     36      36     4    0    0    0   41 :     28      4      0      0
nfs_inode_cache        0      0    704    0    0    2 :   32   16 :      0       0     0    0    0    0   43 :      0      0      0      0
nfs_page               0      0    108    0    0    1 :   32   16 :      0       0     0    0    0    0   68 :      0      0      0      0
isofs_inode_cache      0      0    440    0    0    1 :   32   16 :      0       0     0    0    0    0   41 :      0      0      0      0
ext2_inode_cache       4      7    576    1    1    1 :   32   16 :      7       7     1    0    0    0   39 :      3      1      0      0
journal_head          21     62     60    1    1    1 :   32   16 :    420    1124     9    2    0    1   94 :   6048     73   6046     55
revoke_table           1    144     24    1    1    1 :   32   16 :     16      16     1    0    0    0  176 :      0      1      0      0
revoke_record          0      0     28    0    0    1 :   32   16 :     16      32     2    2    0    0  157 :      3      2      5      0
ext3_inode_cache    2275   2275    576  325  325    1 :   32   16 :   2275    2290   325    0    0    0   39 :   1969    331     28      0
ext3_xattr             0      0     60    0    0    1 :   32   16 :      0       0     0    0    0    0   94 :      0      0      0      0
eventpoll_pwq          0      0     48    0    0    1 :   32   16 :      0       0     0    0    0    0  109 :      0      0      0      0
eventpoll_epi          0      0     72    0    0    1 :   32   16 :      0       0     0    0    0    0   85 :      0      0      0      0
kioctx                 0      0    268    0    0    1 :   32   16 :      0       0     0    0    0    0   46 :      0      0      0      0
kiocb                  0      0    172    0    0    1 :   32   16 :      0       0     0    0    0    0   55 :      0      0      0      0
dnotify_cache          0      0     32    0    0    1 :   32   16 :      0       0     0    0    0    0  144 :      0      0      0      0
file_lock_cache       17     30    128    1    1    1 :   32   16 :     30     173     1    0    0    0   62 :   1093     13   1089      0
fasync_cache           0      0     28    0    0    1 :   32   16 :      0       0     0    0    0    0  157 :      0      0      0      0
shmem_inode_cache      2      7    576    1    1    1 :   32   16 :      7      13     1    0    0    0   39 :      0      2      0      0
idr_layer_cache        0      0    148    0    0    1 :   32   16 :      0       0     0    0    0    0   58 :      0      0      0      0
posix_timers_cache      0      0    136    0    0    1 :   32   16 :      0       0     0    0    0    0   60 :      0      0      0      0
uid_cache              5    101     36    1    1    1 :   32   16 :     18      32     1    0    0    0  133 :      3      2      0      0
sgpool-MAX_PHYS_SEGMENTS     32     32   2048   16   16    1 :   32   16 :     34      38    18    2    0    0   34 :      0     35      3      0
sgpool-64             32     32   1024    8    8    1 :   32   16 :     36      48    12    4    0    0   36 :      6     36     10      0
sgpool-32             32     32    512    4    4    1 :   32   16 :     40      72     9    5    0    0   40 :     33     37     38      0
sgpool-16             32     42    268    3    3    1 :   32   16 :     42      82     3    0    0    0   46 :    228     37    233      0
sgpool-8              50     54    140    2    2    1 :   32   16 :     54     227     2    0    0    0   59 :   4322     45   4335      0
deadline_drq       22274  22361     64  379  379    1 :   32   16 :  22538   32880   507    0    0    1   91 :  31317   2219  10606    658
blkdev_requests    22273  22300    156  892  892    1 :   32   16 :  22541   32829  1204    5    0    2   57 :  30944   2592  10609    655
biovec-BIO_MAX_PAGES    256    260   3072   52   52    4 :   32   16 :    256     256    52    0    0    0   37 :      0    256      0      0
biovec-128           256    260   1536   52   52    2 :   32   16 :    256     256    52    0    0    0   37 :      0    256      0      0
biovec-64            256    260    768   52   52    1 :   32   16 :    260     268    52    0    0    0   37 :      8    259     11      0
biovec-16            256    266    204   14   14    1 :   32   16 :    266     296    14    0    0    0   51 :    293    260    297      0
biovec-4             256    310     60    5    5    1 :   32   16 :    272     304     5    0    0    0   94 :    481    259    484      0
biovec-1             285    288     24    2    2    1 :   32   16 :    512    1277     6    3    0    0  176 :   7486    320   7503     47
bio                  274    318     72    6    6    1 :   32   16 :    509    1391    14    3    0    1   85 :   8265    333   8287     55
sock_inode_cache      36     40    484    5    5    1 :   32   16 :     40      55     5    0    0    0   40 :    545     10    519      0
skbuff_head_cache     16     24    160    1    1    1 :   32   16 :     24      42     1    0    0    0   56 :     13      3      0      0
sock                   2      9    444    1    1    1 :   32   16 :      9      16     1    0    0    0   41 :     14      3     15      0
proc_inode_cache      82     99    440   11   11    1 :   32   16 :     99     179    11    0    0    0   41 :    572     17    521      1
sigqueue               9     27    144    1    1    1 :   32   16 :     16      16     1    0    0    0   59 :    767      1    768      0
radix_tree_node      840    840    272   60   60    1 :   32   16 :    840     884    60    0    0    0   46 :    774     73     11      0
cdev_cache           143    250     28    2    2    1 :   32   16 :    158     173     2    0    0    0  157 :    131     12      0      0
bdev_cache             4     32    120    1    1    1 :   32   16 :     20      32     1    0    0    0   64 :     89      2     87      0
mnt_cache             19     53     72    1    1    1 :   32   16 :     35      54     1    0    0    0   85 :     24      9     14      0
inode_cache         3375   3375    424  375  375    1 :   32   16 :   3375    3418   375    0    0    0   41 :   3340    624    589      0
dentry_cache        6744   6745    204  355  355    1 :   32   16 :   6744    7379   368    2    0    1   51 :   7671   1004   1911     34
filp                 286    300    156   12   12    1 :   32   16 :    291     307    12    0    0    0   57 :    261     25      0      0
names_cache            1      1   4096    1    1    1 :   32   16 :      8      47    23   22    0    0   33 :  15424     47  15471      0
buffer_head         2458   2478     64   42   42    1 :   32   16 :   2458    3060    42    0    0    1   91 :   5662    206   3396     28
mm_struct             35     35    576    5    5    1 :   32   16 :     35      93     5    0    0    0   39 :   1494     11   1479      0
vm_area_struct       500    500     76   10   10    1 :   32   16 :    527    8855    13    1    0    1   82 :  29247    562  28815    513
fs_cache              40     84     44    1    1    1 :   32   16 :     40     128     2    1    0    0  116 :    946      8    929      0
files_cache           27     27    424    3    3    1 :   32   16 :     27      59     4    1    0    0   41 :    944     10    929      0
signal_cache         107    144     52    2    2    1 :   32   16 :    107     205     2    0    0    0  104 :    950     19    877      0
sighand_cache         87     87   1344   29   29    1 :   32   16 :     87     106    30    1    0    0   35 :    920     39    873      0
task_struct           95     95   1632   19   19    2 :   32   16 :    100     117    20    1    0    0   37 :    937     32    874      0
pte_chain           1650   1650     76   33   33    1 :   32   16 :   1650    6684    42    1    0    1   82 :  10041    455   8545    310
pgd                   27     27   4096   27   27    1 :   32   16 :     33      51    48   21    0    0   33 :   1455     50   1479      0
size-131072(DMA)       0      0 131072    0    0   32 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-131072            0      0 131072    0    0   32 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-65536(DMA)        0      0  65536    0    0   16 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-65536             0      0  65536    0    0   16 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-32768(DMA)        0      0  32768    0    0    8 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-32768             0      0  32768    0    0    8 :    8    4 :      1       1     1    1    0    0    9 :      0      1      1      0
size-16384(DMA)        0      0  16384    0    0    4 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-16384             1      1  16384    1    1    4 :    8    4 :      2       4     4    3    0    0    9 :      2      4      5      0
size-8192(DMA)         0      0   8192    0    0    2 :    8    4 :      0       0     0    0    0    0    9 :      0      0      0      0
size-8192             91     91   8192   91   91    2 :    8    4 :     92      96    95    4    0    0    9 :    149     96    154      0
size-4096(DMA)         0      0   4096    0    0    1 :   32   16 :      0       0     0    0    0    0   33 :      0      0      0      0
size-4096             27     27   4096   27   27    1 :   32   16 :     34      57    55   28    0    0   33 :    489     57    519      0
size-2048(DMA)         0      0   2048    0    0    1 :   32   16 :      0       0     0    0    0    0   34 :      0      0      0      0
size-2048             94     94   2048   47   47    1 :   32   16 :     94      96    47    0    0    0   34 :     88     49     43      0
size-1024(DMA)         0      0   1024    0    0    1 :   32   16 :      0       0     0    0    0    0   36 :      0      0      0      0
size-1024             40     40   1024   10   10    1 :   32   16 :     40      67    11    1    0    0   36 :    702     39    703      0
size-512(DMA)          0      0    512    0    0    1 :   32   16 :      8      16     2    2    0    0   40 :     83      2     85      0
size-512             234    248    512   31   31    1 :   32   16 :    248     282    32    1    0    0   40 :   1721     46   1537      0
size-256(DMA)          0      0    268    0    0    1 :   32   16 :     14      28     2    2    0    0   46 :    534      2    536      0
size-256              94     98    268    7    7    1 :   32   16 :     98     149     8    1    0    0   46 :    736     14    657      0
size-192(DMA)          0      0    204    0    0    1 :   32   16 :      0       0     0    0    0    0   51 :      0      0      0      0
size-192             209    209    204   11   11    1 :   32   16 :    209     274    11    0    0    0   51 :    725     75    592      0
size-128(DMA)          0      0    140    0    0    1 :   32   16 :      0       0     0    0    0    0   59 :      0      0      0      0
size-128             681    702    140   26   26    1 :   32   16 :    691     955    26    0    0    0   59 :   2740    297   2351      6
size-64(DMA)           0      0     76    0    0    1 :   32   16 :      0       0     0    0    0    0   82 :      0      0      0      0
size-64              750    750     76   15   15    1 :   32   16 :    750    1041    15    0    0    0   82 :   1141    247    623     27
size-32(DMA)           0      0     44    0    0    1 :   32   16 :      0       0     0    0    0    0  116 :      0      0      0      0
size-32              452    504     44    6    6    1 :   32   16 :    452     619     6    0    0    0  116 :  18166    190  17772    142
kmem_cache           110    110    180    5    5    1 :   32   16 :    110     110     5    0    0    0   54 :     34     72      0      0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-22 11:46     ` Douglas Gilbert
@ 2003-03-22 12:05       ` Andrew Morton
  2003-03-24 21:32         ` Badari Pulavarty
  2003-03-25 10:56         ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2003-03-22 12:05 UTC (permalink / raw)
  To: dougg; +Cc: pbadari, linux-kernel, linux-scsi, Jens Axboe

Douglas Gilbert <dougg@torque.net> wrote:
>
> Andrew Morton wrote:
> > Douglas Gilbert <dougg@torque.net> wrote:
> > 
> >>>Slab:           464364 kB
> >>
> > 
> > It's all in slab.
> > 
> > 
> >>I did notice a rather large growth of nodes
> >>in sysfs. For 84 added scsi_debug pseudo disks the number
> >>of sysfs nodes went from 686 to 3347.
> >>
> >>Does anybody know what is the per node memory cost of sysfs?
> > 
> > 
> > Let's see all of /pro/slabinfo please.
> 
> Andrew,
> Attachments are /proc/slabinfo pre and post:
>   $ modprobe scsi_debug add_host=42 num_devs=2
> which adds 84 pseudo disks.
> 

OK, thanks.  So with 48 disks you've lost five megabytes to blkdev_requests
and deadline_drq objects.  With 4000 disks, you're toast.  That's enough
request structures to put 200 gigabytes of memory under I/O ;)

We need to make the request structures dymanically allocated for other
reasons (which I cannot immediately remember) but it didn't happen.  I guess
we have some motivation now.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-22 12:05       ` Andrew Morton
@ 2003-03-24 21:32         ` Badari Pulavarty
  2003-03-24 22:22           ` Douglas Gilbert
  2003-03-25  0:10           ` Andrew Morton
  2003-03-25 10:56         ` Jens Axboe
  1 sibling, 2 replies; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-24 21:32 UTC (permalink / raw)
  To: Andrew Morton, dougg; +Cc: linux-kernel, linux-scsi, Jens Axboe

On Saturday 22 March 2003 04:05 am, Andrew Morton wrote:
> OK, thanks.  So with 48 disks you've lost five megabytes to blkdev_requests
> and deadline_drq objects.  With 4000 disks, you're toast.  That's enough
> request structures to put 200 gigabytes of memory under I/O ;)
>
> We need to make the request structures dymanically allocated for other
> reasons (which I cannot immediately remember) but it didn't happen.  I
> guess we have some motivation now.

Here is the list of slab caches which consumed more than 1 MB
in the process of inserting 4000 disks.

#insmod scsi_debug.ko add_host=4 num_devs=1000

deadline_drq    before:1280 after:1025420 diff:1024140 size:64 incr:65544960
blkdev_requests  before:1280 after:1025400 diff:1024120 size:156 incr:159762720

* deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix this.

inode_cache     before:700 after:140770 diff:140070 size:364 incr:50985480
dentry_cache    before:4977 after:145061 diff:140084 size:172 incr:24094448

* inode cache increased by 50 MB, dentry cache 24 MB. 
It looks like we cached 140,000 inodes. I wonder why ? 

size-8192       before:8 after:4010 diff:4002 size:8192 incr:32784384

* 32 MB is for 4 hosts ram disk for scsi_debug  (4*8MB)

size-2048       before:112 after:4102 diff:3990 size:2060 incr:8219400
size-512        before:87 after:8085 diff:7998 size:524 incr:4190952
size-192        before:907 after:16910 diff:16003 size:204 incr:3264612
size-64         before:459 after:76500 diff:76041 size:76 incr:5779116
size-32         before:523 after:24528 diff:24005 size:44 incr:1056220

* 30MB for all other structures. I need to look closely on what these are..

Thanks,
Badari


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-24 21:32         ` Badari Pulavarty
@ 2003-03-24 22:22           ` Douglas Gilbert
  2003-03-24 22:54             ` Badari Pulavarty
  2003-03-25  0:10           ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Douglas Gilbert @ 2003-03-24 22:22 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Andrew Morton, linux-kernel, linux-scsi, Jens Axboe

Badari Pulavarty wrote:
> On Saturday 22 March 2003 04:05 am, Andrew Morton wrote:
> 
>>OK, thanks.  So with 48 disks you've lost five megabytes to blkdev_requests
>>and deadline_drq objects.  With 4000 disks, you're toast.  That's enough
>>request structures to put 200 gigabytes of memory under I/O ;)
>>
>>We need to make the request structures dymanically allocated for other
>>reasons (which I cannot immediately remember) but it didn't happen.  I
>>guess we have some motivation now.
> 
> 
> Here is the list of slab caches which consumed more than 1 MB
> in the process of inserting 4000 disks.
> 
> #insmod scsi_debug.ko add_host=4 num_devs=1000
> 
> deadline_drq    before:1280 after:1025420 diff:1024140 size:64 incr:65544960
> blkdev_requests  before:1280 after:1025400 diff:1024120 size:156 incr:159762720
> 
> * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix this.
> 
> inode_cache     before:700 after:140770 diff:140070 size:364 incr:50985480
> dentry_cache    before:4977 after:145061 diff:140084 size:172 incr:24094448
> 
> * inode cache increased by 50 MB, dentry cache 24 MB. 
> It looks like we cached 140,000 inodes. I wonder why ? 
 > <snip/>

Badari,
What number do you get from
  # cd /sys; du -a | wc
when you have 4000 disks?

With two disks the count on my system is 528.

Also scsi_debug should use only 8 MB (default) for
simulated storage shared between all pseudo disks
(i.e. not 8 MB per simulated host).

Doug Gilbert





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-24 22:22           ` Douglas Gilbert
@ 2003-03-24 22:54             ` Badari Pulavarty
  0 siblings, 0 replies; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-24 22:54 UTC (permalink / raw)
  To: dougg; +Cc: Andrew Morton, linux-kernel, linux-scsi, Jens Axboe

On Monday 24 March 2003 02:22 pm, Douglas Gilbert wrote:
> Badari Pulavarty wrote:
> > On Saturday 22 March 2003 04:05 am, Andrew Morton wrote:
> >>OK, thanks.  So with 48 disks you've lost five megabytes to
> >> blkdev_requests and deadline_drq objects.  With 4000 disks, you're
> >> toast.  That's enough request structures to put 200 gigabytes of memory
> >> under I/O ;)
> >>
> >>We need to make the request structures dymanically allocated for other
> >>reasons (which I cannot immediately remember) but it didn't happen.  I
> >>guess we have some motivation now.
> >
> > Here is the list of slab caches which consumed more than 1 MB
> > in the process of inserting 4000 disks.
> >
> > #insmod scsi_debug.ko add_host=4 num_devs=1000
> >
> > deadline_drq    before:1280 after:1025420 diff:1024140 size:64
> > incr:65544960 blkdev_requests  before:1280 after:1025400 diff:1024120
> > size:156 incr:159762720
> >
> > * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix
> > this.
> >
> > inode_cache     before:700 after:140770 diff:140070 size:364
> > incr:50985480 dentry_cache    before:4977 after:145061 diff:140084
> > size:172 incr:24094448
> >
> > * inode cache increased by 50 MB, dentry cache 24 MB.
> > It looks like we cached 140,000 inodes. I wonder why ?
> >
>  > <snip/>
>
> Badari,
> What number do you get from
>   # cd /sys; du -a | wc
> when you have 4000 disks?

[root@elm3b78 sysfs]# du -a | wc -l
 140735

Okay. That explains the inodes.

>
> Also scsi_debug should use only 8 MB (default) for
> simulated storage shared between all pseudo disks
> (i.e. not 8 MB per simulated host).

Hmm. When I did
insmod scsi_debug.o add_host=1 num_devs=4000
it used 8MB. 

But when I did
insmod scsi_debug.o add_host=4 num_devs=1000
it used 32 MB. So I assumed it is per host.

Before:
size-8192              8      8   8192    8    8    2 :    8    4 :      9      
12    10    2    0    0   37 :      3     12      7      0
After:
size-8192           4010   4010   8192 4010 4010    2 :    8    4 :   4010    
4014  4012    2    0    0   37 :   4006   4014   4011      0

Thanks,
Badari




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25  0:10           ` Andrew Morton
@ 2003-03-24 22:57             ` Badari Pulavarty
  0 siblings, 0 replies; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-24 22:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dougg, linux-kernel, linux-scsi, axboe

On Monday 24 March 2003 04:10 pm, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > Here is the list of slab caches which consumed more than 1 MB
> > in the process of inserting 4000 disks.
>
> Thanks for doing this.
>
> > #insmod scsi_debug.ko add_host=4 num_devs=1000
> >
> > deadline_drq    before:1280 after:1025420 diff:1024140 size:64
> > incr:65544960 blkdev_requests  before:1280 after:1025400 diff:1024120
> > size:156 incr:159762720
> >
> > * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix
> > this.
>
> Yes, we do.  But.
>
> > inode_cache     before:700 after:140770 diff:140070 size:364
> > incr:50985480 dentry_cache    before:4977 after:145061 diff:140084
> > size:172 incr:24094448
> >
> > * inode cache increased by 50 MB, dentry cache 24 MB.
> > It looks like we cached 140,000 inodes. I wonder why ?
>
> # find /sys/block/hda | wc -l
>      43
>
> Oh shit, we're toast.
>
> How many partitions did these "disks" have?

These are all scsi_debug disks. No partitions. 
Yeah !! I know what you mean, with partitions we
are going to get 5 more inodes per partition..

[root@elm3b78 sdaaa]# find /sysfs/block/sdaa
/sysfs/block/sdaa
/sysfs/block/sdaa/iosched
/sysfs/block/sdaa/iosched/fifo_batch
/sysfs/block/sdaa/iosched/front_merges
/sysfs/block/sdaa/iosched/writes_starved
/sysfs/block/sdaa/iosched/write_expire
/sysfs/block/sdaa/iosched/read_expire
/sysfs/block/sdaa/device
/sysfs/block/sdaa/stat
/sysfs/block/sdaa/size
/sysfs/block/sdaa/range
/sysfs/block/sdaa/dev

Thanks,
Badari


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-24 21:32         ` Badari Pulavarty
  2003-03-24 22:22           ` Douglas Gilbert
@ 2003-03-25  0:10           ` Andrew Morton
  2003-03-24 22:57             ` Badari Pulavarty
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2003-03-25  0:10 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: dougg, linux-kernel, linux-scsi, axboe

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Here is the list of slab caches which consumed more than 1 MB
> in the process of inserting 4000 disks.

Thanks for doing this.

> #insmod scsi_debug.ko add_host=4 num_devs=1000
> 
> deadline_drq    before:1280 after:1025420 diff:1024140 size:64 incr:65544960
> blkdev_requests  before:1280 after:1025400 diff:1024120 size:156 incr:159762720
> 
> * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix this.

Yes, we do.  But.

> inode_cache     before:700 after:140770 diff:140070 size:364 incr:50985480
> dentry_cache    before:4977 after:145061 diff:140084 size:172 incr:24094448
> 
> * inode cache increased by 50 MB, dentry cache 24 MB. 
> It looks like we cached 140,000 inodes. I wonder why ? 

# find /sys/block/hda | wc -l
     43

Oh shit, we're toast.

How many partitions did these "disks" have?

> size-2048       before:112 after:4102 diff:3990 size:2060 incr:8219400
> size-512        before:87 after:8085 diff:7998 size:524 incr:4190952
> size-192        before:907 after:16910 diff:16003 size:204 incr:3264612
> size-64         before:459 after:76500 diff:76041 size:76 incr:5779116
> size-32         before:523 after:24528 diff:24005 size:44 incr:1056220
> 
> * 30MB for all other structures. I need to look closely on what these are..

Yes, that will take some work.  What I would do is to change kmalloc:

+ int foo;

  kmalloc(...)
  {
+	if (foo)
+		dump_stack();


and then set `foo' in gdb across the registration of one disk.

Probably you can set foo by hand in scsi_register_device() or wherever it
happens.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-22 12:05       ` Andrew Morton
  2003-03-24 21:32         ` Badari Pulavarty
@ 2003-03-25 10:56         ` Jens Axboe
  2003-03-25 11:23           ` Jens Axboe
  2003-03-25 11:39           ` Nick Piggin
  1 sibling, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2003-03-25 10:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dougg, pbadari, linux-kernel, linux-scsi

On Sat, Mar 22 2003, Andrew Morton wrote:
> Douglas Gilbert <dougg@torque.net> wrote:
> >
> > Andrew Morton wrote:
> > > Douglas Gilbert <dougg@torque.net> wrote:
> > > 
> > >>>Slab:           464364 kB
> > >>
> > > 
> > > It's all in slab.
> > > 
> > > 
> > >>I did notice a rather large growth of nodes
> > >>in sysfs. For 84 added scsi_debug pseudo disks the number
> > >>of sysfs nodes went from 686 to 3347.
> > >>
> > >>Does anybody know what is the per node memory cost of sysfs?
> > > 
> > > 
> > > Let's see all of /pro/slabinfo please.
> > 
> > Andrew,
> > Attachments are /proc/slabinfo pre and post:
> >   $ modprobe scsi_debug add_host=42 num_devs=2
> > which adds 84 pseudo disks.
> > 
> 
> OK, thanks.  So with 48 disks you've lost five megabytes to blkdev_requests
> and deadline_drq objects.  With 4000 disks, you're toast.  That's enough
> request structures to put 200 gigabytes of memory under I/O ;)
> 
> We need to make the request structures dymanically allocated for other
> reasons (which I cannot immediately remember) but it didn't happen.  I guess
> we have some motivation now.

Here's a patch that makes the request allocation (and io scheduler
private data) dynamic, with upper and lower bounds of 4 and 256
respectively. The numbers are a bit random - the 4 will allow us to make
progress, but it might be a smidgen too low. Perhaps 8 would be good.
256 is twice as much as before, but that should be alright as long as
the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
two variables.

We loose the old batching functionality, for now. I can resurrect that
if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
needs here. I'll probably end up doing a specialised block pool scheme
for this.

Hasn't been tested all that much, it boots though :-)

 drivers/block/deadline-iosched.c |   86 ++++++++---------
 drivers/block/elevator.c         |   18 +++
 drivers/block/ll_rw_blk.c        |  197 +++++++++++++--------------------------
 include/linux/blkdev.h           |   11 +-
 include/linux/elevator.h         |    7 +
 5 files changed, 141 insertions(+), 178 deletions(-)

===== drivers/block/deadline-iosched.c 1.17 vs edited =====
--- 1.17/drivers/block/deadline-iosched.c	Fri Mar 14 16:55:04 2003
+++ edited/drivers/block/deadline-iosched.c	Tue Mar 25 11:05:37 2003
@@ -71,6 +71,8 @@
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
+
+	mempool_t *drq_pool;
 };
 
 /*
@@ -673,28 +675,11 @@
 static void deadline_exit(request_queue_t *q, elevator_t *e)
 {
 	struct deadline_data *dd = e->elevator_data;
-	struct deadline_rq *drq;
-	struct request *rq;
-	int i;
 
 	BUG_ON(!list_empty(&dd->fifo_list[READ]));
 	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
 
-	for (i = READ; i <= WRITE; i++) {
-		struct request_list *rl = &q->rq[i];
-		struct list_head *entry;
-
-		list_for_each(entry, &rl->free) {
-			rq = list_entry_rq(entry);
-
-			if ((drq = RQ_DATA(rq)) == NULL)
-				continue;
-
-			rq->elevator_private = NULL;
-			kmem_cache_free(drq_pool, drq);
-		}
-	}
-
+	mempool_destroy(dd->drq_pool);
 	kfree(dd->hash);
 	kfree(dd);
 }
@@ -706,9 +691,7 @@
 static int deadline_init(request_queue_t *q, elevator_t *e)
 {
 	struct deadline_data *dd;
-	struct deadline_rq *drq;
-	struct request *rq;
-	int i, ret = 0;
+	int i;
 
 	if (!drq_pool)
 		return -ENOMEM;
@@ -724,6 +707,13 @@
 		return -ENOMEM;
 	}
 
+	dd->drq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, drq_pool);
+	if (!dd->drq_pool) {
+		kfree(dd->hash);
+		kfree(dd);
+		return -ENOMEM;
+	}
+
 	for (i = 0; i < DL_HASH_ENTRIES; i++)
 		INIT_LIST_HEAD(&dd->hash[i]);
 
@@ -739,33 +729,41 @@
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
 	e->elevator_data = dd;
+	return 0;
+}
 
-	for (i = READ; i <= WRITE; i++) {
-		struct request_list *rl = &q->rq[i];
-		struct list_head *entry;
-
-		list_for_each(entry, &rl->free) {
-			rq = list_entry_rq(entry);
-
-			drq = kmem_cache_alloc(drq_pool, GFP_KERNEL);
-			if (!drq) {
-				ret = -ENOMEM;
-				break;
-			}
+static void deadline_put_request(request_queue_t *q, struct request *rq)
+{
+	struct deadline_data *dd = q->elevator.elevator_data;
+	struct deadline_rq *drq = RQ_DATA(rq);
 
-			memset(drq, 0, sizeof(*drq));
-			INIT_LIST_HEAD(&drq->fifo);
-			INIT_LIST_HEAD(&drq->hash);
-			RB_CLEAR(&drq->rb_node);
-			drq->request = rq;
-			rq->elevator_private = drq;
-		}
+	if (drq) {
+		mempool_free(drq, dd->drq_pool);
+		rq->elevator_private = NULL;
 	}
+}
+
+static int
+deadline_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+	struct deadline_data *dd = q->elevator.elevator_data;
+	struct deadline_rq *drq;
 
-	if (ret)
-		deadline_exit(q, e);
+	drq = mempool_alloc(dd->drq_pool, gfp_mask);
+	if (drq) {
+		RB_CLEAR(&drq->rb_node);
+		drq->request = rq;
 
-	return ret;
+		INIT_LIST_HEAD(&drq->hash);
+		drq->hash_valid_count = 0;
+
+		INIT_LIST_HEAD(&drq->fifo);
+
+		rq->elevator_private = drq;
+		return 0;
+	}
+
+	return 1;
 }
 
 /*
@@ -916,6 +914,8 @@
 	.elevator_queue_empty_fn =	deadline_queue_empty,
 	.elevator_former_req_fn =	deadline_former_request,
 	.elevator_latter_req_fn =	deadline_latter_request,
+	.elevator_set_req_fn =		deadline_set_request,
+	.elevator_put_req_fn = 		deadline_put_request,
 	.elevator_init_fn =		deadline_init,
 	.elevator_exit_fn =		deadline_exit,
 
===== drivers/block/elevator.c 1.40 vs edited =====
--- 1.40/drivers/block/elevator.c	Sun Feb 16 12:32:35 2003
+++ edited/drivers/block/elevator.c	Tue Mar 25 11:04:01 2003
@@ -408,6 +408,24 @@
 	return NULL;
 }
 
+int elv_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+	elevator_t *e = &q->elevator;
+
+	if (e->elevator_set_req_fn)
+		return e->elevator_set_req_fn(q, rq, gfp_mask);
+
+	return 0;
+}
+
+void elv_put_request(request_queue_t *q, struct request *rq)
+{
+	elevator_t *e = &q->elevator;
+
+	if (e->elevator_put_req_fn)
+		e->elevator_put_req_fn(q, rq);
+}
+
 int elv_register_queue(struct gendisk *disk)
 {
 	request_queue_t *q = disk->queue;
===== drivers/block/ll_rw_blk.c 1.161 vs edited =====
--- 1.161/drivers/block/ll_rw_blk.c	Mon Mar 24 02:56:03 2003
+++ edited/drivers/block/ll_rw_blk.c	Tue Mar 25 11:53:37 2003
@@ -48,12 +48,6 @@
  */
 static int queue_nr_requests;
 
-/*
- * How many free requests must be available before we wake a process which
- * is waiting for a request?
- */
-static int batch_requests;
-
 unsigned long blk_max_low_pfn, blk_max_pfn;
 int blk_nohighio = 0;
 
@@ -1118,26 +1112,6 @@
 	spin_unlock_irq(&blk_plug_lock);
 }
 
-static int __blk_cleanup_queue(struct request_list *list)
-{
-	struct list_head *head = &list->free;
-	struct request *rq;
-	int i = 0;
-
-	while (!list_empty(head)) {
-		rq = list_entry(head->next, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		kmem_cache_free(request_cachep, rq);
-		i++;
-	}
-
-	if (i != list->count)
-		printk("request list leak!\n");
-
-	list->count = 0;
-	return i;
-}
-
 /**
  * blk_cleanup_queue: - release a &request_queue_t when it is no longer needed
  * @q:    the request queue to be released
@@ -1154,18 +1128,14 @@
  **/
 void blk_cleanup_queue(request_queue_t * q)
 {
-	int count = (queue_nr_requests*2);
+	struct request_list *rl = &q->rq;
 
 	elevator_exit(q);
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
-
 	del_timer_sync(&q->unplug_timer);
 	flush_scheduled_work();
 
-	if (count)
-		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	mempool_destroy(rl->rq_pool);
 
 	if (blk_queue_tagged(q))
 		blk_queue_free_tags(q);
@@ -1175,42 +1145,17 @@
 
 static int blk_init_free_list(request_queue_t *q)
 {
-	struct request_list *rl;
-	struct request *rq;
-	int i;
+	struct request_list *rl = &q->rq;
 
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+	rl->count[READ] = BLKDEV_MAX_RQ;
+	rl->count[WRITE] = BLKDEV_MAX_RQ;
 
-	/*
-	 * Divide requests in half between read and write
-	 */
-	rl = &q->rq[READ];
-	for (i = 0; i < (queue_nr_requests*2); i++) {
-		rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
-		if (!rq)
-			goto nomem;
-
-		/*
-		 * half way through, switch to WRITE list
-		 */
-		if (i == queue_nr_requests)
-			rl = &q->rq[WRITE];
+	rl->rq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, request_cachep);
 
-		memset(rq, 0, sizeof(struct request));
-		rq->rq_status = RQ_INACTIVE;
-		list_add(&rq->queuelist, &rl->free);
-		rl->count++;
-	}
+	if (!rl->rq_pool)
+		return -ENOMEM;
 
-	init_waitqueue_head(&q->rq[READ].wait);
-	init_waitqueue_head(&q->rq[WRITE].wait);
 	return 0;
-nomem:
-	blk_cleanup_queue(q);
-	return 1;
 }
 
 static int __make_request(request_queue_t *, struct bio *);
@@ -1277,34 +1222,57 @@
 	return 0;
 }
 
+static inline void blk_free_request(request_queue_t *q, struct request *rq)
+{
+	elv_put_request(q, rq);
+	mempool_free(rq, q->rq.rq_pool);
+}
+
+static inline struct request *blk_alloc_request(request_queue_t *q,int gfp_mask)
+{
+	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+
+	if (rq) {
+		if (!elv_set_request(q, rq, gfp_mask))
+			return rq;
+
+		mempool_free(rq, q->rq.rq_pool);
+	}
+
+	return NULL;
+}
+
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
 /*
  * Get a free request. queue lock must be held and interrupts
  * disabled on the way in.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl = &q->rq;
 
-	if (!list_empty(&rl->free)) {
-		rq = blkdev_free_rq(&rl->free);
-		list_del_init(&rq->queuelist);
-		rq->ref_count = 1;
-		rl->count--;
-		if (rl->count < queue_congestion_on_threshold())
-			set_queue_congested(q, rw);
+	if (!rl->count[rw])
+		return NULL;
+
+	rq = blk_alloc_request(q, gfp_mask);
+	if (rq) {
+		INIT_LIST_HEAD(&rq->queuelist);
 		rq->flags = 0;
-		rq->rq_status = RQ_ACTIVE;
 		rq->errors = 0;
-		rq->special = NULL;
-		rq->buffer = NULL;
-		rq->data = NULL;
-		rq->sense = NULL;
-		rq->waiting = NULL;
+		rq->rq_status = RQ_ACTIVE;
+		rl->count[rw]--;
+		if (rl->count[rw] < queue_congestion_on_threshold())
+			set_queue_congested(q, rw);
 		rq->bio = rq->biotail = NULL;
+		rq->buffer = NULL;
+		rq->ref_count = 1;
 		rq->q = q;
 		rq->rl = rl;
+		rq->special = NULL;
+		rq->data = NULL;
+		rq->waiting = NULL;
+		rq->sense = NULL;
 	}
 
 	return rq;
@@ -1316,7 +1284,7 @@
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
 	DEFINE_WAIT(wait);
-	struct request_list *rl = &q->rq[rw];
+	struct request_list *rl = &q->rq;
 	struct request *rq;
 
 	spin_lock_prefetch(q->queue_lock);
@@ -1325,20 +1293,15 @@
 	do {
 		int block = 0;
 
-		prepare_to_wait_exclusive(&rl->wait, &wait,
-					TASK_UNINTERRUPTIBLE);
 		spin_lock_irq(q->queue_lock);
-		if (!rl->count)
+		if (!rl->count[rw])
 			block = 1;
 		spin_unlock_irq(q->queue_lock);
 
 		if (block)
 			io_schedule();
-		finish_wait(&rl->wait, &wait);
 
-		spin_lock_irq(q->queue_lock);
-		rq = get_request(q, rw);
-		spin_unlock_irq(q->queue_lock);
+		rq = get_request(q, rw, GFP_NOIO);
 	} while (rq == NULL);
 	return rq;
 }
@@ -1349,13 +1312,7 @@
 
 	BUG_ON(rw != READ && rw != WRITE);
 
-	spin_lock_irq(q->queue_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(q->queue_lock);
-
-	if (!rq && (gfp_mask & __GFP_WAIT))
-		rq = get_request_wait(q, rw);
-
+	rq = get_request(q, rw, gfp_mask);
 	if (rq) {
 		rq->flags = 0;
 		rq->buffer = NULL;
@@ -1374,7 +1331,7 @@
 
 	BUG_ON(rw != READ && rw != WRITE);
 
-	rq = get_request(q, rw);
+	rq = get_request(q, rw, GFP_ATOMIC);
 
 	if (rq) {
 		rq->flags = 0;
@@ -1508,34 +1465,26 @@
 	if (unlikely(!q))
 		return;
 
-	req->rq_status = RQ_INACTIVE;
-	req->q = NULL;
-	req->rl = NULL;
-
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
 	 * it didn't come out of our reserved rq pools
 	 */
 	if (rl) {
-		int rw = 0;
+		int rw = rq_data_dir(req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 
-		list_add(&req->queuelist, &rl->free);
+		blk_free_request(q, req);
 
-		if (rl == &q->rq[WRITE])
-			rw = WRITE;
-		else if (rl == &q->rq[READ])
-			rw = READ;
-		else
-			BUG();
-
-		rl->count++;
-		if (rl->count >= queue_congestion_off_threshold())
+		rl->count[rw]++;
+		if (rl->count[rw] >= queue_congestion_off_threshold())
 			clear_queue_congested(q, rw);
-		if (rl->count >= batch_requests && waitqueue_active(&rl->wait))
-			wake_up(&rl->wait);
 	}
+
+	req->rq_status = RQ_INACTIVE;
+	req->q = NULL;
+	req->rl = NULL;
+	req->elevator_private = NULL;
 }
 
 void blk_put_request(struct request *req)
@@ -1772,7 +1721,7 @@
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
+	} else if ((req = get_request(q, rw, GFP_ATOMIC)) == NULL) {
 		spin_unlock_irq(q->queue_lock);
 
 		/*
@@ -1815,8 +1764,8 @@
 		__blk_put_request(q, freereq);
 
 	if (blk_queue_plugged(q)) {
-		int nr_queued = (queue_nr_requests - q->rq[0].count) +
-				(queue_nr_requests - q->rq[1].count);
+		int nr_queued = (queue_nr_requests - q->rq.count[0]) +
+				(queue_nr_requests - q->rq.count[1]);
 		if (nr_queued == q->unplug_thresh)
 			__generic_unplug_device(q);
 	}
@@ -2183,7 +2132,6 @@
 
 int __init blk_dev_init(void)
 {
-	int total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
 	int i;
 
 	request_cachep = kmem_cache_create("blkdev_requests",
@@ -2191,24 +2139,11 @@
 	if (!request_cachep)
 		panic("Can't create request pool slab cache\n");
 
-	/*
-	 * Free request slots per queue.  One per quarter-megabyte.
-	 * We use this many requests for reads, and this many for writes.
-	 */
-	queue_nr_requests = (total_ram >> 9) & ~7;
-	if (queue_nr_requests < 16)
-		queue_nr_requests = 16;
-	if (queue_nr_requests > 128)
-		queue_nr_requests = 128;
-
-	batch_requests = queue_nr_requests / 8;
-	if (batch_requests > 8)
-		batch_requests = 8;
+	queue_nr_requests = BLKDEV_MAX_RQ;
 
 	printk("block request queues:\n");
-	printk(" %d requests per read queue\n", queue_nr_requests);
-	printk(" %d requests per write queue\n", queue_nr_requests);
-	printk(" %d requests per batch\n", batch_requests);
+	printk(" %d/%d requests per read queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
+	printk(" %d/%d requests per write queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
 	printk(" enter congestion at %d\n", queue_congestion_on_threshold());
 	printk(" exit congestion at %d\n", queue_congestion_off_threshold());
 
===== include/linux/blkdev.h 1.98 vs edited =====
--- 1.98/include/linux/blkdev.h	Tue Feb 18 11:29:00 2003
+++ edited/include/linux/blkdev.h	Tue Mar 25 10:24:29 2003
@@ -10,6 +10,7 @@
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
 #include <linux/wait.h>
+#include <linux/mempool.h>
 
 #include <asm/scatterlist.h>
 
@@ -18,10 +19,12 @@
 struct elevator_s;
 typedef struct elevator_s elevator_t;
 
+#define BLKDEV_MIN_RQ	4
+#define BLKDEV_MAX_RQ	256
+
 struct request_list {
-	unsigned int count;
-	struct list_head free;
-	wait_queue_head_t wait;
+	unsigned int count[2];
+	mempool_t *rq_pool;
 };
 
 /*
@@ -180,7 +183,7 @@
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	request_fn_proc		*request_fn;
 	merge_request_fn	*back_merge_fn;
===== include/linux/elevator.h 1.18 vs edited =====
--- 1.18/include/linux/elevator.h	Sun Jan 12 15:10:40 2003
+++ edited/include/linux/elevator.h	Tue Mar 25 11:03:42 2003
@@ -15,6 +15,8 @@
 typedef void (elevator_remove_req_fn) (request_queue_t *, struct request *);
 typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *);
 typedef struct list_head *(elevator_get_sort_head_fn) (request_queue_t *, struct request *);
+typedef int (elevator_set_req_fn) (request_queue_t *, struct request *, int);
+typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
 
 typedef int (elevator_init_fn) (request_queue_t *, elevator_t *);
 typedef void (elevator_exit_fn) (request_queue_t *, elevator_t *);
@@ -34,6 +36,9 @@
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
 
+	elevator_set_req_fn *elevator_set_req_fn;
+	elevator_put_req_fn *elevator_put_req_fn;
+
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 
@@ -58,6 +63,8 @@
 extern struct request *elv_latter_request(request_queue_t *, struct request *);
 extern int elv_register_queue(struct gendisk *);
 extern void elv_unregister_queue(struct gendisk *);
+extern int elv_set_request(request_queue_t *, struct request *, int);
+extern void elv_put_request(request_queue_t *, struct request *);
 
 #define __elv_add_request_pos(q, rq, pos)	\
 	(q)->elevator.elevator_add_req_fn((q), (rq), (pos))

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 10:56         ` Jens Axboe
@ 2003-03-25 11:23           ` Jens Axboe
  2003-03-25 11:37             ` Jens Axboe
  2003-03-25 11:39           ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2003-03-25 11:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dougg, pbadari, linux-kernel, linux-scsi

On Tue, Mar 25 2003, Jens Axboe wrote:
> Here's a patch that makes the request allocation (and io scheduler
> private data) dynamic, with upper and lower bounds of 4 and 256
> respectively. The numbers are a bit random - the 4 will allow us to make
> progress, but it might be a smidgen too low. Perhaps 8 would be good.
> 256 is twice as much as before, but that should be alright as long as
> the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
> two variables.
> 
> We loose the old batching functionality, for now. I can resurrect that
> if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
> needs here. I'll probably end up doing a specialised block pool scheme
> for this.
> 
> Hasn't been tested all that much, it boots though :-)

Here's a version with better lock handling. We drop the queue_lock for
most parts of __make_request(), except the actual io scheduler calls.

Patch is against 2.5.66-BK as of today.

 drivers/block/deadline-iosched.c |   86 ++++++-------
 drivers/block/elevator.c         |   18 ++
 drivers/block/ll_rw_blk.c        |  245 ++++++++++++---------------------------
 include/linux/blkdev.h           |   12 +
 include/linux/elevator.h         |    7 +
 5 files changed, 154 insertions(+), 214 deletions(-)

===== drivers/block/deadline-iosched.c 1.17 vs edited =====
--- 1.17/drivers/block/deadline-iosched.c	Fri Mar 14 16:55:04 2003
+++ edited/drivers/block/deadline-iosched.c	Tue Mar 25 11:05:37 2003
@@ -71,6 +71,8 @@
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
+
+	mempool_t *drq_pool;
 };
 
 /*
@@ -673,28 +675,11 @@
 static void deadline_exit(request_queue_t *q, elevator_t *e)
 {
 	struct deadline_data *dd = e->elevator_data;
-	struct deadline_rq *drq;
-	struct request *rq;
-	int i;
 
 	BUG_ON(!list_empty(&dd->fifo_list[READ]));
 	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
 
-	for (i = READ; i <= WRITE; i++) {
-		struct request_list *rl = &q->rq[i];
-		struct list_head *entry;
-
-		list_for_each(entry, &rl->free) {
-			rq = list_entry_rq(entry);
-
-			if ((drq = RQ_DATA(rq)) == NULL)
-				continue;
-
-			rq->elevator_private = NULL;
-			kmem_cache_free(drq_pool, drq);
-		}
-	}
-
+	mempool_destroy(dd->drq_pool);
 	kfree(dd->hash);
 	kfree(dd);
 }
@@ -706,9 +691,7 @@
 static int deadline_init(request_queue_t *q, elevator_t *e)
 {
 	struct deadline_data *dd;
-	struct deadline_rq *drq;
-	struct request *rq;
-	int i, ret = 0;
+	int i;
 
 	if (!drq_pool)
 		return -ENOMEM;
@@ -724,6 +707,13 @@
 		return -ENOMEM;
 	}
 
+	dd->drq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, drq_pool);
+	if (!dd->drq_pool) {
+		kfree(dd->hash);
+		kfree(dd);
+		return -ENOMEM;
+	}
+
 	for (i = 0; i < DL_HASH_ENTRIES; i++)
 		INIT_LIST_HEAD(&dd->hash[i]);
 
@@ -739,33 +729,41 @@
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
 	e->elevator_data = dd;
+	return 0;
+}
 
-	for (i = READ; i <= WRITE; i++) {
-		struct request_list *rl = &q->rq[i];
-		struct list_head *entry;
-
-		list_for_each(entry, &rl->free) {
-			rq = list_entry_rq(entry);
-
-			drq = kmem_cache_alloc(drq_pool, GFP_KERNEL);
-			if (!drq) {
-				ret = -ENOMEM;
-				break;
-			}
+static void deadline_put_request(request_queue_t *q, struct request *rq)
+{
+	struct deadline_data *dd = q->elevator.elevator_data;
+	struct deadline_rq *drq = RQ_DATA(rq);
 
-			memset(drq, 0, sizeof(*drq));
-			INIT_LIST_HEAD(&drq->fifo);
-			INIT_LIST_HEAD(&drq->hash);
-			RB_CLEAR(&drq->rb_node);
-			drq->request = rq;
-			rq->elevator_private = drq;
-		}
+	if (drq) {
+		mempool_free(drq, dd->drq_pool);
+		rq->elevator_private = NULL;
 	}
+}
+
+static int
+deadline_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+	struct deadline_data *dd = q->elevator.elevator_data;
+	struct deadline_rq *drq;
 
-	if (ret)
-		deadline_exit(q, e);
+	drq = mempool_alloc(dd->drq_pool, gfp_mask);
+	if (drq) {
+		RB_CLEAR(&drq->rb_node);
+		drq->request = rq;
 
-	return ret;
+		INIT_LIST_HEAD(&drq->hash);
+		drq->hash_valid_count = 0;
+
+		INIT_LIST_HEAD(&drq->fifo);
+
+		rq->elevator_private = drq;
+		return 0;
+	}
+
+	return 1;
 }
 
 /*
@@ -916,6 +914,8 @@
 	.elevator_queue_empty_fn =	deadline_queue_empty,
 	.elevator_former_req_fn =	deadline_former_request,
 	.elevator_latter_req_fn =	deadline_latter_request,
+	.elevator_set_req_fn =		deadline_set_request,
+	.elevator_put_req_fn = 		deadline_put_request,
 	.elevator_init_fn =		deadline_init,
 	.elevator_exit_fn =		deadline_exit,
 
===== drivers/block/elevator.c 1.40 vs edited =====
--- 1.40/drivers/block/elevator.c	Sun Feb 16 12:32:35 2003
+++ edited/drivers/block/elevator.c	Tue Mar 25 11:04:01 2003
@@ -408,6 +408,24 @@
 	return NULL;
 }
 
+int elv_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+	elevator_t *e = &q->elevator;
+
+	if (e->elevator_set_req_fn)
+		return e->elevator_set_req_fn(q, rq, gfp_mask);
+
+	return 0;
+}
+
+void elv_put_request(request_queue_t *q, struct request *rq)
+{
+	elevator_t *e = &q->elevator;
+
+	if (e->elevator_put_req_fn)
+		e->elevator_put_req_fn(q, rq);
+}
+
 int elv_register_queue(struct gendisk *disk)
 {
 	request_queue_t *q = disk->queue;
===== drivers/block/ll_rw_blk.c 1.161 vs edited =====
--- 1.161/drivers/block/ll_rw_blk.c	Mon Mar 24 02:56:03 2003
+++ edited/drivers/block/ll_rw_blk.c	Tue Mar 25 12:20:16 2003
@@ -48,12 +48,6 @@
  */
 static int queue_nr_requests;
 
-/*
- * How many free requests must be available before we wake a process which
- * is waiting for a request?
- */
-static int batch_requests;
-
 unsigned long blk_max_low_pfn, blk_max_pfn;
 int blk_nohighio = 0;
 
@@ -1118,26 +1112,6 @@
 	spin_unlock_irq(&blk_plug_lock);
 }
 
-static int __blk_cleanup_queue(struct request_list *list)
-{
-	struct list_head *head = &list->free;
-	struct request *rq;
-	int i = 0;
-
-	while (!list_empty(head)) {
-		rq = list_entry(head->next, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		kmem_cache_free(request_cachep, rq);
-		i++;
-	}
-
-	if (i != list->count)
-		printk("request list leak!\n");
-
-	list->count = 0;
-	return i;
-}
-
 /**
  * blk_cleanup_queue: - release a &request_queue_t when it is no longer needed
  * @q:    the request queue to be released
@@ -1154,18 +1128,14 @@
  **/
 void blk_cleanup_queue(request_queue_t * q)
 {
-	int count = (queue_nr_requests*2);
+	struct request_list *rl = &q->rq;
 
 	elevator_exit(q);
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
-
 	del_timer_sync(&q->unplug_timer);
 	flush_scheduled_work();
 
-	if (count)
-		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	mempool_destroy(rl->rq_pool);
 
 	if (blk_queue_tagged(q))
 		blk_queue_free_tags(q);
@@ -1175,42 +1145,17 @@
 
 static int blk_init_free_list(request_queue_t *q)
 {
-	struct request_list *rl;
-	struct request *rq;
-	int i;
+	struct request_list *rl = &q->rq;
 
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+	rl->count[READ] = BLKDEV_MAX_RQ;
+	rl->count[WRITE] = BLKDEV_MAX_RQ;
 
-	/*
-	 * Divide requests in half between read and write
-	 */
-	rl = &q->rq[READ];
-	for (i = 0; i < (queue_nr_requests*2); i++) {
-		rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
-		if (!rq)
-			goto nomem;
-
-		/*
-		 * half way through, switch to WRITE list
-		 */
-		if (i == queue_nr_requests)
-			rl = &q->rq[WRITE];
+	rl->rq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, request_cachep);
 
-		memset(rq, 0, sizeof(struct request));
-		rq->rq_status = RQ_INACTIVE;
-		list_add(&rq->queuelist, &rl->free);
-		rl->count++;
-	}
+	if (!rl->rq_pool)
+		return -ENOMEM;
 
-	init_waitqueue_head(&q->rq[READ].wait);
-	init_waitqueue_head(&q->rq[WRITE].wait);
 	return 0;
-nomem:
-	blk_cleanup_queue(q);
-	return 1;
 }
 
 static int __make_request(request_queue_t *, struct bio *);
@@ -1277,34 +1222,62 @@
 	return 0;
 }
 
+static inline void blk_free_request(request_queue_t *q, struct request *rq)
+{
+	elv_put_request(q, rq);
+	mempool_free(rq, q->rq.rq_pool);
+}
+
+static inline struct request *blk_alloc_request(request_queue_t *q,int gfp_mask)
+{
+	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+
+	if (rq) {
+		if (!elv_set_request(q, rq, gfp_mask))
+			return rq;
+
+		mempool_free(rq, q->rq.rq_pool);
+	}
+
+	return NULL;
+}
+
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
 /*
- * Get a free request. queue lock must be held and interrupts
- * disabled on the way in.
+ * Get a free request, queue_lock must not be held
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl = &q->rq;
 
-	if (!list_empty(&rl->free)) {
-		rq = blkdev_free_rq(&rl->free);
-		list_del_init(&rq->queuelist);
-		rq->ref_count = 1;
-		rl->count--;
-		if (rl->count < queue_congestion_on_threshold())
+	/*
+	 * known racy, ok though
+	 */
+	if (!rl->count[rw])
+		return NULL;
+
+	rq = blk_alloc_request(q, gfp_mask);
+	if (rq) {
+		spin_lock_irq(q->queue_lock);
+		rl->count[rw]--;
+		if (rl->count[rw] < queue_congestion_on_threshold())
 			set_queue_congested(q, rw);
+		spin_unlock_irq(q->queue_lock);
+
+		INIT_LIST_HEAD(&rq->queuelist);
 		rq->flags = 0;
-		rq->rq_status = RQ_ACTIVE;
 		rq->errors = 0;
-		rq->special = NULL;
-		rq->buffer = NULL;
-		rq->data = NULL;
-		rq->sense = NULL;
-		rq->waiting = NULL;
+		rq->rq_status = RQ_ACTIVE;
 		rq->bio = rq->biotail = NULL;
+		rq->buffer = NULL;
+		rq->ref_count = 1;
 		rq->q = q;
 		rq->rl = rl;
+		rq->special = NULL;
+		rq->data = NULL;
+		rq->waiting = NULL;
+		rq->sense = NULL;
 	}
 
 	return rq;
@@ -1316,7 +1289,7 @@
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
 	DEFINE_WAIT(wait);
-	struct request_list *rl = &q->rq[rw];
+	struct request_list *rl = &q->rq;
 	struct request *rq;
 
 	spin_lock_prefetch(q->queue_lock);
@@ -1325,64 +1298,24 @@
 	do {
 		int block = 0;
 
-		prepare_to_wait_exclusive(&rl->wait, &wait,
-					TASK_UNINTERRUPTIBLE);
 		spin_lock_irq(q->queue_lock);
-		if (!rl->count)
+		if (!rl->count[rw])
 			block = 1;
 		spin_unlock_irq(q->queue_lock);
 
 		if (block)
 			io_schedule();
-		finish_wait(&rl->wait, &wait);
 
-		spin_lock_irq(q->queue_lock);
-		rq = get_request(q, rw);
-		spin_unlock_irq(q->queue_lock);
+		rq = get_request(q, rw, GFP_NOIO);
 	} while (rq == NULL);
 	return rq;
 }
 
 struct request *blk_get_request(request_queue_t *q, int rw, int gfp_mask)
 {
-	struct request *rq;
-
-	BUG_ON(rw != READ && rw != WRITE);
-
-	spin_lock_irq(q->queue_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(q->queue_lock);
-
-	if (!rq && (gfp_mask & __GFP_WAIT))
-		rq = get_request_wait(q, rw);
-
-	if (rq) {
-		rq->flags = 0;
-		rq->buffer = NULL;
-		rq->bio = rq->biotail = NULL;
-		rq->waiting = NULL;
-	}
-	return rq;
-}
-
-/*
- * Non-locking blk_get_request variant, for special requests from drivers.
- */
-struct request *__blk_get_request(request_queue_t *q, int rw)
-{
-	struct request *rq;
-
 	BUG_ON(rw != READ && rw != WRITE);
 
-	rq = get_request(q, rw);
-
-	if (rq) {
-		rq->flags = 0;
-		rq->buffer = NULL;
-		rq->bio = rq->biotail = NULL;
-		rq->waiting = NULL;
-	}
-	return rq;
+	return get_request(q, rw, gfp_mask);
 }
 
 /**
@@ -1499,6 +1432,9 @@
 	disk->stamp_idle = now;
 }
 
+/*
+ * queue lock must be held
+ */
 void __blk_put_request(request_queue_t *q, struct request *req)
 {
 	struct request_list *rl = req->rl;
@@ -1508,34 +1444,26 @@
 	if (unlikely(!q))
 		return;
 
-	req->rq_status = RQ_INACTIVE;
-	req->q = NULL;
-	req->rl = NULL;
-
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
 	 * it didn't come out of our reserved rq pools
 	 */
 	if (rl) {
-		int rw = 0;
+		int rw = rq_data_dir(req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 
-		list_add(&req->queuelist, &rl->free);
+		blk_free_request(q, req);
 
-		if (rl == &q->rq[WRITE])
-			rw = WRITE;
-		else if (rl == &q->rq[READ])
-			rw = READ;
-		else
-			BUG();
-
-		rl->count++;
-		if (rl->count >= queue_congestion_off_threshold())
+		rl->count[rw]++;
+		if (rl->count[rw] >= queue_congestion_off_threshold())
 			clear_queue_congested(q, rw);
-		if (rl->count >= batch_requests && waitqueue_active(&rl->wait))
-			wake_up(&rl->wait);
 	}
+
+	req->rq_status = RQ_INACTIVE;
+	req->q = NULL;
+	req->rl = NULL;
+	req->elevator_private = NULL;
 }
 
 void blk_put_request(struct request *req)
@@ -1694,9 +1622,9 @@
 
 	barrier = test_bit(BIO_RW_BARRIER, &bio->bi_rw);
 
-	spin_lock_irq(q->queue_lock);
 again:
 	insert_here = NULL;
+	spin_lock_irq(q->queue_lock);
 
 	if (blk_queue_empty(q)) {
 		blk_plug_device(q);
@@ -1769,12 +1697,12 @@
 	 * a free slot.
 	 */
 get_rq:
+	spin_unlock_irq(q->queue_lock);
+
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(q->queue_lock);
-
+	} else if ((req = get_request(q, rw, GFP_NOIO)) == NULL) {
 		/*
 		 * READA bit set
 		 */
@@ -1782,7 +1710,6 @@
 			goto end_io;
 
 		freereq = get_request_wait(q, rw);
-		spin_lock_irq(q->queue_lock);
 		goto again;
 	}
 
@@ -1809,19 +1736,20 @@
 	req->bio = req->biotail = bio;
 	req->rq_disk = bio->bi_bdev->bd_disk;
 	req->start_time = jiffies;
+
+	spin_lock_irq(q->queue_lock);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
 		__blk_put_request(q, freereq);
 
 	if (blk_queue_plugged(q)) {
-		int nr_queued = (queue_nr_requests - q->rq[0].count) +
-				(queue_nr_requests - q->rq[1].count);
+		int nr_queued = (queue_nr_requests - q->rq.count[0]) +
+				(queue_nr_requests - q->rq.count[1]);
 		if (nr_queued == q->unplug_thresh)
 			__generic_unplug_device(q);
 	}
 	spin_unlock_irq(q->queue_lock);
-
 	return 0;
 
 end_io:
@@ -2183,7 +2111,6 @@
 
 int __init blk_dev_init(void)
 {
-	int total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
 	int i;
 
 	request_cachep = kmem_cache_create("blkdev_requests",
@@ -2191,24 +2118,11 @@
 	if (!request_cachep)
 		panic("Can't create request pool slab cache\n");
 
-	/*
-	 * Free request slots per queue.  One per quarter-megabyte.
-	 * We use this many requests for reads, and this many for writes.
-	 */
-	queue_nr_requests = (total_ram >> 9) & ~7;
-	if (queue_nr_requests < 16)
-		queue_nr_requests = 16;
-	if (queue_nr_requests > 128)
-		queue_nr_requests = 128;
-
-	batch_requests = queue_nr_requests / 8;
-	if (batch_requests > 8)
-		batch_requests = 8;
+	queue_nr_requests = BLKDEV_MAX_RQ;
 
 	printk("block request queues:\n");
-	printk(" %d requests per read queue\n", queue_nr_requests);
-	printk(" %d requests per write queue\n", queue_nr_requests);
-	printk(" %d requests per batch\n", batch_requests);
+	printk(" %d/%d requests per read queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
+	printk(" %d/%d requests per write queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
 	printk(" enter congestion at %d\n", queue_congestion_on_threshold());
 	printk(" exit congestion at %d\n", queue_congestion_off_threshold());
 
@@ -2250,7 +2164,6 @@
 EXPORT_SYMBOL(blk_phys_contig_segment);
 EXPORT_SYMBOL(blk_hw_contig_segment);
 EXPORT_SYMBOL(blk_get_request);
-EXPORT_SYMBOL(__blk_get_request);
 EXPORT_SYMBOL(blk_put_request);
 EXPORT_SYMBOL(blk_insert_request);
 
===== include/linux/blkdev.h 1.98 vs edited =====
--- 1.98/include/linux/blkdev.h	Tue Feb 18 11:29:00 2003
+++ edited/include/linux/blkdev.h	Tue Mar 25 12:12:20 2003
@@ -10,6 +10,7 @@
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
 #include <linux/wait.h>
+#include <linux/mempool.h>
 
 #include <asm/scatterlist.h>
 
@@ -18,10 +19,12 @@
 struct elevator_s;
 typedef struct elevator_s elevator_t;
 
+#define BLKDEV_MIN_RQ	4
+#define BLKDEV_MAX_RQ	256
+
 struct request_list {
-	unsigned int count;
-	struct list_head free;
-	wait_queue_head_t wait;
+	int count[2];
+	mempool_t *rq_pool;
 };
 
 /*
@@ -180,7 +183,7 @@
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	request_fn_proc		*request_fn;
 	merge_request_fn	*back_merge_fn;
@@ -330,7 +333,6 @@
 extern void blk_attempt_remerge(request_queue_t *, struct request *);
 extern void __blk_attempt_remerge(request_queue_t *, struct request *);
 extern struct request *blk_get_request(request_queue_t *, int, int);
-extern struct request *__blk_get_request(request_queue_t *, int);
 extern void blk_put_request(struct request *);
 extern void blk_insert_request(request_queue_t *, struct request *, int, void *);
 extern void blk_plug_device(request_queue_t *);
===== include/linux/elevator.h 1.18 vs edited =====
--- 1.18/include/linux/elevator.h	Sun Jan 12 15:10:40 2003
+++ edited/include/linux/elevator.h	Tue Mar 25 11:03:42 2003
@@ -15,6 +15,8 @@
 typedef void (elevator_remove_req_fn) (request_queue_t *, struct request *);
 typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *);
 typedef struct list_head *(elevator_get_sort_head_fn) (request_queue_t *, struct request *);
+typedef int (elevator_set_req_fn) (request_queue_t *, struct request *, int);
+typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
 
 typedef int (elevator_init_fn) (request_queue_t *, elevator_t *);
 typedef void (elevator_exit_fn) (request_queue_t *, elevator_t *);
@@ -34,6 +36,9 @@
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
 
+	elevator_set_req_fn *elevator_set_req_fn;
+	elevator_put_req_fn *elevator_put_req_fn;
+
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 
@@ -58,6 +63,8 @@
 extern struct request *elv_latter_request(request_queue_t *, struct request *);
 extern int elv_register_queue(struct gendisk *);
 extern void elv_unregister_queue(struct gendisk *);
+extern int elv_set_request(request_queue_t *, struct request *, int);
+extern void elv_put_request(request_queue_t *, struct request *);
 
 #define __elv_add_request_pos(q, rq, pos)	\
 	(q)->elevator.elevator_add_req_fn((q), (rq), (pos))

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 11:23           ` Jens Axboe
@ 2003-03-25 11:37             ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2003-03-25 11:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dougg, pbadari, linux-kernel, linux-scsi

On Tue, Mar 25 2003, Jens Axboe wrote:
> On Tue, Mar 25 2003, Jens Axboe wrote:
> > Here's a patch that makes the request allocation (and io scheduler
> > private data) dynamic, with upper and lower bounds of 4 and 256
> > respectively. The numbers are a bit random - the 4 will allow us to make
> > progress, but it might be a smidgen too low. Perhaps 8 would be good.
> > 256 is twice as much as before, but that should be alright as long as
> > the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
> > two variables.
> > 
> > We loose the old batching functionality, for now. I can resurrect that
> > if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
> > needs here. I'll probably end up doing a specialised block pool scheme
> > for this.
> > 
> > Hasn't been tested all that much, it boots though :-)
> 
> Here's a version with better lock handling. We drop the queue_lock for
> most parts of __make_request(), except the actual io scheduler calls.

That was buggy, fixing... Back later.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 10:56         ` Jens Axboe
  2003-03-25 11:23           ` Jens Axboe
@ 2003-03-25 11:39           ` Nick Piggin
  2003-03-25 12:01             ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2003-03-25 11:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, dougg, pbadari, linux-kernel, linux-scsi



Jens Axboe wrote:

>On Sat, Mar 22 2003, Andrew Morton wrote:
>
>>Douglas Gilbert <dougg@torque.net> wrote:
>>
>>>Andrew Morton wrote:
>>>
>>>>Douglas Gilbert <dougg@torque.net> wrote:
>>>>
>>>>
>>>>>>Slab:           464364 kB
>>>>>>
>>>>It's all in slab.
>>>>
>>>>
>>>>
>>>>>I did notice a rather large growth of nodes
>>>>>in sysfs. For 84 added scsi_debug pseudo disks the number
>>>>>of sysfs nodes went from 686 to 3347.
>>>>>
>>>>>Does anybody know what is the per node memory cost of sysfs?
>>>>>
>>>>
>>>>Let's see all of /pro/slabinfo please.
>>>>
>>>Andrew,
>>>Attachments are /proc/slabinfo pre and post:
>>>  $ modprobe scsi_debug add_host=42 num_devs=2
>>>which adds 84 pseudo disks.
>>>
>>>
>>OK, thanks.  So with 48 disks you've lost five megabytes to blkdev_requests
>>and deadline_drq objects.  With 4000 disks, you're toast.  That's enough
>>request structures to put 200 gigabytes of memory under I/O ;)
>>
>>We need to make the request structures dymanically allocated for other
>>reasons (which I cannot immediately remember) but it didn't happen.  I guess
>>we have some motivation now.
>>
>
>Here's a patch that makes the request allocation (and io scheduler
>private data) dynamic, with upper and lower bounds of 4 and 256
>respectively. The numbers are a bit random - the 4 will allow us to make
>progress, but it might be a smidgen too low. Perhaps 8 would be good.
>256 is twice as much as before, but that should be alright as long as
>the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
>two variables.
>
>We loose the old batching functionality, for now. I can resurrect that
>if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
>needs here. I'll probably end up doing a specialised block pool scheme
>for this.
>
>Hasn't been tested all that much, it boots though :-)
>
Nice Jens. Very good in theory but I haven't looked at the
code too much yet.

Would it be possible to have all queues allocate out of
the one global pool of free requests. This way you could
have a big minimum (say 128) and a big maximum
(say min(Mbytes, spindles).

This way memory usage is decoupled from the number of
queues, and busy spindles could make use of more
available free requests.

Oh and the max value can easily be runtime tunable, right?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 11:39           ` Nick Piggin
@ 2003-03-25 12:01             ` Jens Axboe
  2003-03-25 12:12               ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2003-03-25 12:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, dougg, pbadari, linux-kernel, linux-scsi

On Tue, Mar 25 2003, Nick Piggin wrote:
> 
> 
> Jens Axboe wrote:
> 
> >On Sat, Mar 22 2003, Andrew Morton wrote:
> >
> >>Douglas Gilbert <dougg@torque.net> wrote:
> >>
> >>>Andrew Morton wrote:
> >>>
> >>>>Douglas Gilbert <dougg@torque.net> wrote:
> >>>>
> >>>>
> >>>>>>Slab:           464364 kB
> >>>>>>
> >>>>It's all in slab.
> >>>>
> >>>>
> >>>>
> >>>>>I did notice a rather large growth of nodes
> >>>>>in sysfs. For 84 added scsi_debug pseudo disks the number
> >>>>>of sysfs nodes went from 686 to 3347.
> >>>>>
> >>>>>Does anybody know what is the per node memory cost of sysfs?
> >>>>>
> >>>>
> >>>>Let's see all of /pro/slabinfo please.
> >>>>
> >>>Andrew,
> >>>Attachments are /proc/slabinfo pre and post:
> >>> $ modprobe scsi_debug add_host=42 num_devs=2
> >>>which adds 84 pseudo disks.
> >>>
> >>>
> >>OK, thanks.  So with 48 disks you've lost five megabytes to 
> >>blkdev_requests
> >>and deadline_drq objects.  With 4000 disks, you're toast.  That's enough
> >>request structures to put 200 gigabytes of memory under I/O ;)
> >>
> >>We need to make the request structures dymanically allocated for other
> >>reasons (which I cannot immediately remember) but it didn't happen.  I 
> >>guess
> >>we have some motivation now.
> >>
> >
> >Here's a patch that makes the request allocation (and io scheduler
> >private data) dynamic, with upper and lower bounds of 4 and 256
> >respectively. The numbers are a bit random - the 4 will allow us to make
> >progress, but it might be a smidgen too low. Perhaps 8 would be good.
> >256 is twice as much as before, but that should be alright as long as
> >the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
> >two variables.
> >
> >We loose the old batching functionality, for now. I can resurrect that
> >if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
> >needs here. I'll probably end up doing a specialised block pool scheme
> >for this.
> >
> >Hasn't been tested all that much, it boots though :-)
> >
> Nice Jens. Very good in theory but I haven't looked at the
> code too much yet.
> 
> Would it be possible to have all queues allocate out of
> the one global pool of free requests. This way you could
> have a big minimum (say 128) and a big maximum
> (say min(Mbytes, spindles).

Well not really, as far as I can see we _need_ a pool per queue. Imagine
a bio handed to raid, needs to be split to 6 different queues. But our
minimum is 4, deadlock possibility. It could probably be made to work,
however I greatly prefer a per-queue reserve.

> This way memory usage is decoupled from the number of
> queues, and busy spindles could make use of more
> available free requests.
> 
> Oh and the max value can easily be runtime tunable, right?

Sure. However, they don't really mean _anything_. Max is just some
random number to prevent one queue going nuts, and could be completely
removed if the vm works perfectly. Beyond some limit there's little
benefit to doing that, though. But MAX could be runtime tunable. Min is
basically just to make sure we don't kill ourselves, I don't see any
point in making that runtime tunable. It's not really a tunable.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 12:01             ` Jens Axboe
@ 2003-03-25 12:12               ` Nick Piggin
  2003-03-25 12:35                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2003-03-25 12:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, dougg, pbadari, linux-kernel, linux-scsi



Jens Axboe wrote:

>On Tue, Mar 25 2003, Nick Piggin wrote:
>
>>
>>Nice Jens. Very good in theory but I haven't looked at the
>>code too much yet.
>>
>>Would it be possible to have all queues allocate out of
>>the one global pool of free requests. This way you could
>>have a big minimum (say 128) and a big maximum
>>(say min(Mbytes, spindles).
>>
>
>Well not really, as far as I can see we _need_ a pool per queue. Imagine
>a bio handed to raid, needs to be split to 6 different queues. But our
>minimum is 4, deadlock possibility. It could probably be made to work,
>however I greatly prefer a per-queue reserve.
>
OK yeah you are right there. In light of your comment below
I'm happy with that. I was mostly worried about queues being
restricted to a small maximum.

>
>
>>This way memory usage is decoupled from the number of
>>queues, and busy spindles could make use of more
>>available free requests.
>>
>>Oh and the max value can easily be runtime tunable, right?
>>
>
>Sure. However, they don't really mean _anything_. Max is just some
>random number to prevent one queue going nuts, and could be completely
>removed if the vm works perfectly. Beyond some limit there's little
>benefit to doing that, though. But MAX could be runtime tunable. Min is
>basically just to make sure we don't kill ourselves, I don't see any
>point in making that runtime tunable. It's not really a tunable.
>
OK thats good then. I would like to see max removed, however perhaps
the VM isn't up to that yet. I'll be testing this when your code
solidifies!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 12:12               ` Nick Piggin
@ 2003-03-25 12:35                 ` Jens Axboe
  2003-03-27  0:29                   ` Badari Pulavarty
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2003-03-25 12:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, dougg, pbadari, linux-kernel, linux-scsi

On Tue, Mar 25 2003, Nick Piggin wrote:
> Jens Axboe wrote:
> 
> >On Tue, Mar 25 2003, Nick Piggin wrote:
> >
> >>
> >>Nice Jens. Very good in theory but I haven't looked at the
> >>code too much yet.
> >>
> >>Would it be possible to have all queues allocate out of
> >>the one global pool of free requests. This way you could
> >>have a big minimum (say 128) and a big maximum
> >>(say min(Mbytes, spindles).
> >>
> >
> >Well not really, as far as I can see we _need_ a pool per queue. Imagine
> >a bio handed to raid, needs to be split to 6 different queues. But our
> >minimum is 4, deadlock possibility. It could probably be made to work,
> >however I greatly prefer a per-queue reserve.
> >
> OK yeah you are right there. In light of your comment below
> I'm happy with that. I was mostly worried about queues being
> restricted to a small maximum.

Understandable, we'll make max tunable.

> >>This way memory usage is decoupled from the number of
> >>queues, and busy spindles could make use of more
> >>available free requests.
> >>
> >>Oh and the max value can easily be runtime tunable, right?
> >>
> >
> >Sure. However, they don't really mean _anything_. Max is just some
> >random number to prevent one queue going nuts, and could be completely
> >removed if the vm works perfectly. Beyond some limit there's little
> >benefit to doing that, though. But MAX could be runtime tunable. Min is
> >basically just to make sure we don't kill ourselves, I don't see any
> >point in making that runtime tunable. It's not really a tunable.
> >
> OK thats good then. I would like to see max removed, however perhaps
> the VM isn't up to that yet. I'll be testing this when your code
> solidifies!

Only testing will tell, so yes you are very welcome to give it a shot.
Let me release a known working version first :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-25 12:35                 ` Jens Axboe
@ 2003-03-27  0:29                   ` Badari Pulavarty
  2003-03-27  9:18                     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-27  0:29 UTC (permalink / raw)
  To: Jens Axboe, Nick Piggin; +Cc: Andrew Morton, dougg, linux-kernel, linux-scsi

On Tuesday 25 March 2003 04:35 am, Jens Axboe wrote:

> Only testing will tell, so yes you are very welcome to give it a shot.
> Let me release a known working version first :)

Jens,

 I  found whats using 32MB out of 8192-byte slab.

size-8192       before:10 after:4012 diff:4002 size:8192 incr:32784384

It is deadline_init():

        dd->hash = kmalloc(sizeof(struct list_head)*DL_HASH_ENTRIES,GFP_KERNEL);

It is creating 8K hash table for each queue. Since we have 4000 queues,
it used 32MB. I wonder why the current code needs 1024 hash buckets, 
when maximum requests are only 256. And also, since you are making
request allocation dynamic, can you change this too ? Any issues here ?

Thanks,
Badari

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-27  0:29                   ` Badari Pulavarty
@ 2003-03-27  9:18                     ` Jens Axboe
  2003-03-28 17:04                       ` Badari Pulavarty
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2003-03-27  9:18 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Nick Piggin, Andrew Morton, dougg, linux-kernel, linux-scsi

On Wed, Mar 26 2003, Badari Pulavarty wrote:
> On Tuesday 25 March 2003 04:35 am, Jens Axboe wrote:
> 
> > Only testing will tell, so yes you are very welcome to give it a shot.
> > Let me release a known working version first :)
> 
> Jens,
> 
>  I  found whats using 32MB out of 8192-byte slab.
> 
> size-8192       before:10 after:4012 diff:4002 size:8192 incr:32784384
> 
> It is deadline_init():
> 
>         dd->hash = kmalloc(sizeof(struct list_head)*DL_HASH_ENTRIES,GFP_KERNEL);
> 
> It is creating 8K hash table for each queue. Since we have 4000 queues,

Yes

> it used 32MB. I wonder why the current code needs 1024 hash buckets, 

Hmm actually that's a leftover from when we played with bigger queue
sizes, I inadvertently forgot to change it back when pushing the rbtree
deadline update to Linus. It used to be 256. We can shrink this to 2^7
or 2^8 instead, which will then only eat 1-2K.

> when maximum requests are only 256. And also, since you are making
> request allocation dynamic, can you change this too ? Any issues here ?

No real issues to shrinking it, bigger problem if we move to larger
queues. With the rq-dyn-alloc patch, we can make the max number of
requests ceiling a lot higher and then the hash needs to be bigger too.
But for now, 256 entry should be a good default and suffice for the
future, I'll push that change.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-27  9:18                     ` Jens Axboe
@ 2003-03-28 17:04                       ` Badari Pulavarty
  2003-03-28 18:41                         ` Andries Brouwer
  0 siblings, 1 reply; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-28 17:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Nick Piggin, Andrew Morton, dougg, linux-kernel, linux-scsi

Hi All,

I found 2048-byte slab heavy users. 8MB doesn't seem much they all
add up.

size-2048       before:98 after:4095 diff:3997 size:2060 incr:8233820

The problem is with

	sd.c: sd_attach()
		 alloc_disk(16)

alloc_disk() allocates "hd_struct" structure for 15 minors. 
So it is a 84*15 = 1260 byte allocation. They all come from 
2048-byte slabs. Since I have 4000 simulated disks, it uses up 8MB.

Proposed fixes:

1) Make the allocations come from its own slab, instead of
2048-byte slab. (~40% saving).

2) Instead of allocatinf hd_struct structure for all possible partitions,
why not allocated them dynamically, as we see a partition ? This
way we could (in theory) support more than 16 partitions, if needed.

Are there any issues with doing (2) ? 

Thanks,
Badari

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-28 17:04                       ` Badari Pulavarty
@ 2003-03-28 18:41                         ` Andries Brouwer
  2003-03-29  1:39                           ` Badari Pulavarty
  0 siblings, 1 reply; 22+ messages in thread
From: Andries Brouwer @ 2003-03-28 18:41 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Jens Axboe, Nick Piggin, Andrew Morton, dougg, linux-kernel, linux-scsi

On Fri, Mar 28, 2003 at 09:04:41AM -0800, Badari Pulavarty wrote:

> 2) Instead of allocating hd_struct structure for all possible partitions,
> why not allocated them dynamically, as we see a partition ? This
> way we could (in theory) support more than 16 partitions, if needed.

This is what I plan to do.
Of course you are welcome to do it first.

Andries


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch for playing] 2.5.65 patch to support > 256 disks
  2003-03-28 18:41                         ` Andries Brouwer
@ 2003-03-29  1:39                           ` Badari Pulavarty
  0 siblings, 0 replies; 22+ messages in thread
From: Badari Pulavarty @ 2003-03-29  1:39 UTC (permalink / raw)
  To: Andries Brouwer
  Cc: Jens Axboe, Nick Piggin, Andrew Morton, dougg, linux-kernel, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Friday 28 March 2003 10:41 am, Andries Brouwer wrote:
> On Fri, Mar 28, 2003 at 09:04:41AM -0800, Badari Pulavarty wrote:
> > 2) Instead of allocating hd_struct structure for all possible partitions,
> > why not allocated them dynamically, as we see a partition ? This
> > way we could (in theory) support more than 16 partitions, if needed.
>
> This is what I plan to do.
> Of course you are welcome to do it first.
>
> Andries

Okay !! Here is my patch to add hd_structs dynamically as we add partitions.
Machine boots fine. I was able to add/delete partitions.

It is not polished yet, but any comments ? 

Thanks,
Badari


[-- Attachment #2: dyn.part --]
[-- Type: text/x-diff, Size: 7327 bytes --]

--- linux/include/linux/genhd.h	Fri Mar 28 14:49:59 2003
+++ linux.new/include/linux/genhd.h	Fri Mar 28 18:17:55 2003
@@ -63,7 +63,7 @@ struct hd_struct {
 	devfs_handle_t de;              /* primary (master) devfs entry  */
 	struct kobject kobj;
 	unsigned reads, read_sectors, writes, write_sectors;
-	int policy;
+	int policy, partno;
 };
 
 #define GENHD_FL_REMOVABLE  1
@@ -89,7 +89,7 @@ struct gendisk {
 	int minor_shift;		/* number of times minor is shifted to
 					   get real minor */
 	char disk_name[16];		/* name of major driver */
-	struct hd_struct *part;		/* [indexed by minor] */
+	struct hd_struct **part;	/* [indexed by minor] */
 	struct block_device_operations *fops;
 	struct request_queue *queue;
 	void *private_data;
--- linux/drivers/block/genhd.c	Fri Mar 28 14:30:57 2003
+++ linux.new/drivers/block/genhd.c	Fri Mar 28 18:21:17 2003
@@ -365,11 +365,13 @@ static int show_partition(struct seq_fil
 		(unsigned long long)get_capacity(sgp) >> 1,
 		disk_name(sgp, 0, buf));
 	for (n = 0; n < sgp->minors - 1; n++) {
-		if (sgp->part[n].nr_sects == 0)
+		if (!sgp->part[n])
+			continue;
+		if (sgp->part[n]->nr_sects == 0)
 			continue;
 		seq_printf(part, "%4d  %4d %10llu %s\n",
 			sgp->major, n + 1 + sgp->first_minor,
-			(unsigned long long)sgp->part[n].nr_sects >> 1 ,
+			(unsigned long long)sgp->part[n]->nr_sects >> 1 ,
 			disk_name(sgp, n + 1, buf));
 	}
 
@@ -531,8 +533,6 @@ static decl_subsys(block,&ktype_block);
 
 struct gendisk *alloc_disk(int minors)
 {
-	int dbg = 0 ;
-
 	struct gendisk *disk = kmalloc(sizeof(struct gendisk), GFP_KERNEL);
 	if (disk) {
 		memset(disk, 0, sizeof(struct gendisk));
@@ -541,7 +541,7 @@ struct gendisk *alloc_disk(int minors)
 			return NULL;
 		}
 		if (minors > 1) {
-			int size = (minors - 1) * sizeof(struct hd_struct);
+			int size = (minors - 1) * sizeof(struct hd_struct *);
 			disk->part = kmalloc(size, GFP_KERNEL);
 			if (!disk->part) {
 				kfree(disk);
@@ -593,8 +593,8 @@ void set_device_ro(struct block_device *
 	struct gendisk *disk = bdev->bd_disk;
 	if (bdev->bd_contains != bdev) {
 		int part = bdev->bd_dev - MKDEV(disk->major, disk->first_minor);
-		struct hd_struct *p = &disk->part[part-1];
-		p->policy = flag;
+		struct hd_struct *p = disk->part[part-1];
+		if (p) p->policy = flag;
 	} else
 		disk->policy = flag;
 }
@@ -604,7 +604,7 @@ void set_disk_ro(struct gendisk *disk, i
 	int i;
 	disk->policy = flag;
 	for (i = 0; i < disk->minors - 1; i++)
-		disk->part[i].policy = flag;
+		if (disk->part[i]) disk->part[i]->policy = flag;
 }
 
 int bdev_read_only(struct block_device *bdev)
@@ -615,8 +615,9 @@ int bdev_read_only(struct block_device *
 	disk = bdev->bd_disk;
 	if (bdev->bd_contains != bdev) {
 		int part = bdev->bd_dev - MKDEV(disk->major, disk->first_minor);
-		struct hd_struct *p = &disk->part[part-1];
-		return p->policy;
+		struct hd_struct *p = disk->part[part-1];
+		if (p) return p->policy;
+		return 0;
 	} else
 		return disk->policy;
 }
--- linux/drivers/block/ioctl.c	Fri Mar 28 15:01:48 2003
+++ linux.new/drivers/block/ioctl.c	Fri Mar 28 18:16:55 2003
@@ -41,11 +41,11 @@ static int blkpg_ioctl(struct block_devi
 					return -EINVAL;
 			}
 			/* partition number in use? */
-			if (disk->part[part - 1].nr_sects != 0)
+			if (disk->part[part - 1])
 				return -EBUSY;
 			/* overlap? */
 			for (i = 0; i < disk->minors - 1; i++) {
-				struct hd_struct *s = &disk->part[i];
+				struct hd_struct *s = disk->part[i];
 				if (!(start+length <= s->start_sect ||
 				      start >= s->start_sect + s->nr_sects))
 					return -EBUSY;
@@ -54,7 +54,9 @@ static int blkpg_ioctl(struct block_devi
 			add_partition(disk, part, start, length);
 			return 0;
 		case BLKPG_DEL_PARTITION:
-			if (disk->part[part - 1].nr_sects == 0)
+			if (!disk->part[part-1])
+				return -ENXIO;
+			if (disk->part[part - 1]->nr_sects == 0)
 				return -ENXIO;
 			/* partition in use? Incomplete check for now. */
 			bdevp = bdget(MKDEV(disk->major, disk->first_minor) + part);
--- linux/drivers/block/ll_rw_blk.c	Fri Mar 28 14:58:48 2003
+++ linux.new/drivers/block/ll_rw_blk.c	Fri Mar 28 18:15:25 2003
@@ -1867,7 +1867,7 @@ static inline void blk_partition_remap(s
 	if (bdev == bdev->bd_contains)
 		return;
 
-	p = &disk->part[bdev->bd_dev-MKDEV(disk->major,disk->first_minor)-1];
+	p = disk->part[bdev->bd_dev-MKDEV(disk->major,disk->first_minor)-1];
 	switch (bio->bi_rw) {
 	case READ:
 		p->read_sectors += bio_sectors(bio);
--- linux/fs/block_dev.c	Fri Mar 28 17:36:28 2003
+++ linux.new/fs/block_dev.c	Fri Mar 28 18:15:08 2003
@@ -559,10 +559,10 @@ static int do_open(struct block_device *
 			bdev->bd_contains = whole;
 			down(&whole->bd_sem);
 			whole->bd_part_count++;
-			p = disk->part + part - 1;
+			p = disk->part[part - 1];
 			bdev->bd_inode->i_data.backing_dev_info =
 			   whole->bd_inode->i_data.backing_dev_info;
-			if (!(disk->flags & GENHD_FL_UP) || !p->nr_sects) {
+			if (!(disk->flags & GENHD_FL_UP) || !p || !p->nr_sects) {
 				whole->bd_part_count--;
 				up(&whole->bd_sem);
 				ret = -ENXIO;
--- linux/fs/partitions/check.c	Fri Mar 28 14:32:29 2003
+++ linux.new/fs/partitions/check.c	Fri Mar 28 18:22:56 2003
@@ -103,8 +103,8 @@ char *disk_name(struct gendisk *hd, int 
 		}
 		sprintf(buf, "%s", hd->disk_name);
 	} else {
-		if (hd->part[part-1].de) {
-			pos = devfs_generate_path(hd->part[part-1].de, buf, 64);
+		if (hd->part[part-1]->de) {
+			pos = devfs_generate_path(hd->part[part-1]->de, buf, 64);
 			if (pos >= 0)
 				return buf + pos;
 		}
@@ -160,7 +160,7 @@ static void devfs_register_partition(str
 {
 #ifdef CONFIG_DEVFS_FS
 	devfs_handle_t dir;
-	struct hd_struct *p = dev->part;
+	struct hd_struct *p = dev->parts;
 	char devname[16];
 
 	if (p[part-1].de)
@@ -203,7 +203,7 @@ static struct sysfs_ops part_sysfs_ops =
 static ssize_t part_dev_read(struct hd_struct * p, char *page)
 {
 	struct gendisk *disk = container_of(p->kobj.parent,struct gendisk,kobj);
-	int part = p - disk->part + 1;
+	int part = p->partno;
 	dev_t base = MKDEV(disk->major, disk->first_minor); 
 	return sprintf(page, "%04x\n", (unsigned)(base + part));
 }
@@ -255,19 +255,30 @@ static struct kobj_type ktype_part = {
 
 void delete_partition(struct gendisk *disk, int part)
 {
-	struct hd_struct *p = disk->part + part - 1;
+	struct hd_struct *p = disk->part[part-1];
+	if (!p)
+		return;
 	if (!p->nr_sects)
 		return;
+	printk("del_partition: disk:%x part:%d \n", disk, part);
 	p->start_sect = 0;
 	p->nr_sects = 0;
 	p->reads = p->writes = p->read_sectors = p->write_sectors = 0;
 	devfs_unregister(p->de);
 	kobject_unregister(&p->kobj);
+	disk->part[part-1] = NULL;
+	kfree(p);
 }
 
 void add_partition(struct gendisk *disk, int part, sector_t start, sector_t len)
 {
-	struct hd_struct *p = disk->part + part - 1;
+	struct hd_struct *p;
+	printk("add_partition: disk:%x part:%d start:%d len:%d\n", disk, part, (int)start, (int)len);
+
+	p = kmalloc(sizeof(struct hd_struct), GFP_KERNEL);
+	if (!p) return;
+
+	memset(p, 0, sizeof(struct hd_struct));
 
 	p->start_sect = start;
 	p->nr_sects = len;
@@ -276,6 +287,9 @@ void add_partition(struct gendisk *disk,
 	p->kobj.parent = &disk->kobj;
 	p->kobj.ktype = &ktype_part;
 	kobject_register(&p->kobj);
+
+	p->partno = part;
+	disk->part[part-1] = p;
 }
 
 static void disk_sysfs_symlinks(struct gendisk *disk)

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2003-03-29  1:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-21 18:56 [patch for playing] 2.5.65 patch to support > 256 disks Badari Pulavarty
2003-03-22 11:00 ` Douglas Gilbert
2003-03-22 11:04   ` Andrew Morton
2003-03-22 11:46     ` Douglas Gilbert
2003-03-22 12:05       ` Andrew Morton
2003-03-24 21:32         ` Badari Pulavarty
2003-03-24 22:22           ` Douglas Gilbert
2003-03-24 22:54             ` Badari Pulavarty
2003-03-25  0:10           ` Andrew Morton
2003-03-24 22:57             ` Badari Pulavarty
2003-03-25 10:56         ` Jens Axboe
2003-03-25 11:23           ` Jens Axboe
2003-03-25 11:37             ` Jens Axboe
2003-03-25 11:39           ` Nick Piggin
2003-03-25 12:01             ` Jens Axboe
2003-03-25 12:12               ` Nick Piggin
2003-03-25 12:35                 ` Jens Axboe
2003-03-27  0:29                   ` Badari Pulavarty
2003-03-27  9:18                     ` Jens Axboe
2003-03-28 17:04                       ` Badari Pulavarty
2003-03-28 18:41                         ` Andries Brouwer
2003-03-29  1:39                           ` Badari Pulavarty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).