All of lore.kernel.org
 help / color / mirror / Atom feed
* Newbie device mapper questions
@ 2015-06-15 18:39 Johannes Bauer
  2015-06-15 19:52 ` Doug Dumitru
  2015-06-16 19:46 ` Alasdair G Kergon
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Bauer @ 2015-06-15 18:39 UTC (permalink / raw)
  To: device-mapper development

Hi list,

so I've had this idea stuck in my head for a while and am finally not
intimidated enough my the dm API to actually give it a shot and play
around. I'm just getting used to the DM internals so please apologize if
I sound like an idiot, I'm just new to the DM.

Something that I would like to implement first is a device mapper target
that takes three block devices as input: Two equally sized devices (src1
and src2) and a separate metadata device (meta). I want to map chunks of
the src devices to bits of a bitmap in the meta device. If the bit is
set in the meta device decides whether src1 and src2 is returned.

Sounds pretty easy and I also got surprisingly far with my little kernel
module. I've so far implemented ctr, dtr, map and status.

In map() I actually do the switching operation. I've looked at how
dm-linear implements this and copied a lot of information. Currently I
do static switching (fixed block size, ingoring the meta device). Here
are my questions:

- How can I read within the kernel from the block device lc->meta->bdev?
If I call "read_dev_sector" from "map" this results in a deadlock, I'm
guessing this is now how it's supposed to work. The bcache module must
perform something similar (because it also reads and writes metadata,
only much more complex), but I'll be damned but couldn't find out where
the actual reading/writing is performed in the code. What are things
that I should look at?

- Is the ctr callback the appropriate place to fail if a logical error
occurs? For example, if two src devices of dissimilar size are passed to
dmsetup?

- Is i_size_read(lc->src1dev->bdev->bd_inode) the correct way of
determining the size of the underlying block device? If not, which
function is?

- Can I safely assume the logical sector size is fixed to be 512 bytes
in all cases?

- In the dm-linear example, bio_sectors(bio) is checked. This gives, if
I understand it correctly, the size in sectors of the BIO (usually this
is 8). What I don't understand is in which cases this can become zero
(dm-linear has a if that checks for bio_sectors(bio) != 0).

- Can I determine the size the bio in map() will have already in ctr()
somehow? Can I assume it will never change if it was once determined?
The reason is that for my example I need to make sure the chunk size is
a integer multiple of the bio size and I would only like to check this
once (in ctr) and not every time (in map).

Thank you very much for helping out a complete newbie :-)
Best regards,
Johannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-15 18:39 Newbie device mapper questions Johannes Bauer
@ 2015-06-15 19:52 ` Doug Dumitru
  2015-06-15 22:20   ` Vivek Goyal
  2015-06-16 18:54   ` Johannes Bauer
  2015-06-16 19:46 ` Alasdair G Kergon
  1 sibling, 2 replies; 9+ messages in thread
From: Doug Dumitru @ 2015-06-15 19:52 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 4860 bytes --]

On Mon, Jun 15, 2015 at 11:39 AM, Johannes Bauer <dfnsonfsduifb@gmx.de>
wrote:

> Hi list,
>
> so I've had this idea stuck in my head for a while and am finally not
> intimidated enough my the dm API to actually give it a shot and play
> around. I'm just getting used to the DM internals so please apologize if
> I sound like an idiot, I'm just new to the DM.
>
> Something that I would like to implement first is a device mapper target
> that takes three block devices as input: Two equally sized devices (src1
> and src2) and a separate metadata device (meta). I want to map chunks of
> the src devices to bits of a bitmap in the meta device. If the bit is
> set in the meta device decides whether src1 and src2 is returned.
>
> Sounds pretty easy and I also got surprisingly far with my little kernel
> module. I've so far implemented ctr, dtr, map and status.
>

​Congratulations, you are actually a long way there.​


>
> In map() I actually do the switching operation. I've looked at how
> dm-linear implements this and copied a lot of information. Currently I
> do static switching (fixed block size, ingoring the meta device). Here
> are my questions:
>
> - How can I read within the kernel from the block device lc->meta->bdev?
> If I call "read_dev_sector" from "map" this results in a deadlock, I'm
> guessing this is now how it's supposed to work. The bcache module must
> perform something similar (because it also reads and writes metadata,
> only much more complex), but I'll be damned but couldn't find out where
> the actual reading/writing is performed in the code. What are things
> that I should look at?
>

​You have to allocate a bio, populate it, allocate pages for buffer,
populate the bvec, and call make_request (or generic make request).  You
will get the completion from the bio on the bottom half of the interrupt
handler, so how much work you can do there is debatable.  You cannot start
an new IO from there, which you need to.  You will probably want to start a
helper thread and have the completion routine schedule itself onto your
thread.  Once you are back on your thread, you can do just about anything.

Because you need to do IO, you will not be able to do a simple bio "bounce
redirect".  You will need to do the IO youself (ie, call another make
request), but you can use the callers bvec for this, so there is no data
copy required.  Once the request completes, you can then fin the caller.
​


>
> - Is the ctr callback the appropriate place to fail if a logical error
> occurs? For example, if two src devices of dissimilar size are passed to
> dmsetup?
>

​If you cannot continue because devices are not present or the right size,
yes you should fail the ctr routine.

If you want to setup /proc or other monitoring stuff, you can use the init
routine, probably plus some statics, to setup "views" into your module.  If
you want to support multiple instances (and you should), setup a
/proc/{yourname} directory on the init and then populate it with
sub-directories every time you create a device.
​


>
> - Is i_size_read(lc->src1dev->bdev->bd_inode) the correct way of
> determining the size of the underlying block device? If not, which
> function is?
>

​... I am happy to leave out answers that I don't know ...​


>
> - Can I safely assume the logical sector size is fixed to be 512 bytes
> in all cases?
>

​Probably not, but maybe.  You are in control of the hardware.​


>
> - In the dm-linear example, bio_sectors(bio) is checked. This gives, if
> I understand it correctly, the size in sectors of the BIO (usually this
> is 8). What I don't understand is in which cases this can become zero
> (dm-linear has a if that checks for bio_sectors(bio) != 0).
>

​.. just a sanity check.  If you get a call of zero size, it means
something else is broken.​


>
> - Can I determine the size the bio in map() will have already in ctr()
> somehow? Can I assume it will never change if it was once determined?
> The reason is that for my example I need to make sure the chunk size is
> a integer multiple of the bio size and I would only like to check this
> once (in ctr) and not every time (in map).
>

​Block size will not change.  The size of requests to you is limited by the
setup of ti->max_io_len.  If you don't set this with recent kernels, you
will only get 4K, which is not all that efficient.  This is actually part
of another big topic of "stacked limits", which someone could write a book
on (and I would read it).​


>
> Thank you very much for helping out a complete newbie :-)
> Best regards,
> Johannes
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>



-- 
Doug Dumitru
EasyCo LLC

[-- Attachment #1.2: Type: text/html, Size: 7542 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-15 19:52 ` Doug Dumitru
@ 2015-06-15 22:20   ` Vivek Goyal
  2015-06-16  0:55     ` Minfei Huang
  2015-06-16 18:54   ` Johannes Bauer
  1 sibling, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2015-06-15 22:20 UTC (permalink / raw)
  To: doug, device-mapper development

On Mon, Jun 15, 2015 at 12:52:57PM -0700, Doug Dumitru wrote:

[..]
> >
> > In map() I actually do the switching operation. I've looked at how
> > dm-linear implements this and copied a lot of information. Currently I
> > do static switching (fixed block size, ingoring the meta device). Here
> > are my questions:
> >
> > - How can I read within the kernel from the block device lc->meta->bdev?
> > If I call "read_dev_sector" from "map" this results in a deadlock, I'm
> > guessing this is now how it's supposed to work. The bcache module must
> > perform something similar (because it also reads and writes metadata,
> > only much more complex), but I'll be damned but couldn't find out where
> > the actual reading/writing is performed in the code. What are things
> > that I should look at?
> >
> 
> ​You have to allocate a bio, populate it, allocate pages for buffer,
> populate the bvec, and call make_request (or generic make request).  You
> will get the completion from the bio on the bottom half of the interrupt
> handler, so how much work you can do there is debatable.  You cannot start
> an new IO from there, which you need to.  You will probably want to start a
> helper thread and have the completion routine schedule itself onto your
> thread.  Once you are back on your thread, you can do just about anything.
> 
> Because you need to do IO, you will not be able to do a simple bio "bounce
> redirect".  You will need to do the IO youself (ie, call another make
> request), but you can use the callers bvec for this, so there is no data
> copy required.  Once the request completes, you can then fin the caller.
> ​

Above sounds right.


[..]
> >
> > - Is i_size_read(lc->src1dev->bdev->bd_inode) the correct way of
> > determining the size of the underlying block device? If not, which
> > function is?
> >

I believe that's correct. Look at block/blk-core.c

handle_bad_sector(struct bio *bio) 
{
}

> 
> ​... I am happy to leave out answers that I don't know ...​
> 
> 
> >
> > - Can I safely assume the logical sector size is fixed to be 512 bytes
> > in all cases?
> >
> 
> ​Probably not, but maybe.  You are in control of the hardware.​

You mean block size or sector size? I think sector sizes can vary and
targets can allow user space to specify one. Typically metadata is per
block can lead to smaller metadata foot print.

> 
> 
> >
> > - In the dm-linear example, bio_sectors(bio) is checked. This gives, if
> > I understand it correctly, the size in sectors of the BIO (usually this
> > is 8). What I don't understand is in which cases this can become zero
> > (dm-linear has a if that checks for bio_sectors(bio) != 0).
> >
> 
> ​.. just a sanity check.  If you get a call of zero size, it means
> something else is broken.​
> 
> 
> >
> > - Can I determine the size the bio in map() will have already in ctr()
> > somehow? Can I assume it will never change if it was once determined?
> > The reason is that for my example I need to make sure the chunk size is
> > a integer multiple of the bio size and I would only like to check this
> > once (in ctr) and not every time (in map).
> >

I think size of bio can change and it depends on the submitter. Targets
can specify maximum size of bio and upper layers will adhere to it. So
I don't think you can fix the size of bio in constructor. I guess you will
have to check incoming bio in map().

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-15 22:20   ` Vivek Goyal
@ 2015-06-16  0:55     ` Minfei Huang
  2015-06-16  1:18       ` Minfei Huang
  0 siblings, 1 reply; 9+ messages in thread
From: Minfei Huang @ 2015-06-16  0:55 UTC (permalink / raw)
  To: device-mapper development; +Cc: doug

On Tue, Jun 16, 2015 at 6:20 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Jun 15, 2015 at 12:52:57PM -0700, Doug Dumitru wrote:
>> >
>> > - Is i_size_read(lc->src1dev->bdev->bd_inode) the correct way of
>> > determining the size of the underlying block device? If not, which
>> > function is?
>> >
>
> I believe that's correct. Look at block/blk-core.c
>
> handle_bad_sector(struct bio *bio)
> {
> }
>

Kernel will call i_size_write to assign the device capacity, when the
device try to be registered(register_disk). During the period of
register, the function blkdev_get will call bd_set_size to set the
capacity.

So it is correct that you can call i_size_read to determine the device's
capacity.

>>
>> ... I am happy to leave out answers that I don't know ...
>>
>>
>> >
>> > - Can I safely assume the logical sector size is fixed to be 512 bytes
>> > in all cases?
>> >
>>
>> Probably not, but maybe.  You are in control of the hardware.
>
> You mean block size or sector size? I think sector sizes can vary and
> targets can allow user space to specify one. Typically metadata is per
> block can lead to smaller metadata foot print.
>

For now, the minimum align unit is 512 bytes. But as I know, a lot of
devices start to support 4KB align unit.

Please correct me if there is something wrong in the comment.

Thanks
Minfei

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-16  0:55     ` Minfei Huang
@ 2015-06-16  1:18       ` Minfei Huang
  0 siblings, 0 replies; 9+ messages in thread
From: Minfei Huang @ 2015-06-16  1:18 UTC (permalink / raw)
  To: dfnsonfsduifb; +Cc: device-mapper development

On 06/16/15 at 08:55P, Minfei Huang wrote:
> On Tue, Jun 16, 2015 at 6:20 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Jun 15, 2015 at 12:52:57PM -0700, Doug Dumitru wrote:
> >> >
> >> > - Is i_size_read(lc->src1dev->bdev->bd_inode) the correct way of
> >> > determining the size of the underlying block device? If not, which
> >> > function is?
> >> >
> >
> > I believe that's correct. Look at block/blk-core.c
> >
> > handle_bad_sector(struct bio *bio)
> > {
> > }
> >
> 
> Kernel will call i_size_write to assign the device capacity, when the
> device try to be registered(register_disk). During the period of
> register, the function blkdev_get will call bd_set_size to set the
> capacity.
> 
> So it is correct that you can call i_size_read to determine the device's
> capacity.
> 
> >>
> >> ... I am happy to leave out answers that I don't know ...
> >>
> >>
> >> >
> >> > - Can I safely assume the logical sector size is fixed to be 512 bytes
> >> > in all cases?
> >> >
> >>
> >> Probably not, but maybe.  You are in control of the hardware.
> >
> > You mean block size or sector size? I think sector sizes can vary and
> > targets can allow user space to specify one. Typically metadata is per
> > block can lead to smaller metadata foot print.
> >
> 
> For now, the minimum align unit is 512 bytes. But as I know, a lot of
> devices start to support 4KB align unit.

Correct it.

Although the device supports the physical sector to extend to 4KB, the
device will divide physical sector into 8 pieces (one pieces is 512
bytes, call logical sector).

$ sudo hdparm -I /dev/sda | egrep -i "physical|logical|device size with M"
Logical		max	current
Logical  Sector size:                   512 bytes
Physical Sector size:                  4096 bytes
Logical Sector-0 offset:                  0 bytes
device size with M = 1024*1024:      953869 MBytes
device size with M = 1000*1000:     1000204 MBytes (1000 GB)

So I think you can use logical sector(512 bytes) to align the start
position and count.

Thanks
Minfei

> 
> Please correct me if there is something wrong in the comment.
> 
> Thanks
> Minfei

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-15 19:52 ` Doug Dumitru
  2015-06-15 22:20   ` Vivek Goyal
@ 2015-06-16 18:54   ` Johannes Bauer
  2015-06-16 19:37     ` Alasdair G Kergon
  2015-06-16 21:05     ` Doug Dumitru
  1 sibling, 2 replies; 9+ messages in thread
From: Johannes Bauer @ 2015-06-16 18:54 UTC (permalink / raw)
  To: dm-devel, doug

On 15.06.2015 21:52, Doug Dumitru wrote:

>> Sounds pretty easy and I also got surprisingly far with my little kernel
>> module. I've so far implemented ctr, dtr, map and status.
> 
> ​Congratulations, you are actually a long way there.​

Thanks but I think I have the mountain still ahead -- still, I would
really like to figure out the nitty-gritty.

> ​You have to allocate a bio, populate it, allocate pages for buffer,
> populate the bvec, and call make_request (or generic make request).  You
> will get the completion from the bio on the bottom half of the interrupt
> handler, so how much work you can do there is debatable.  You cannot start
> an new IO from there, which you need to.  You will probably want to start a
> helper thread and have the completion routine schedule itself onto your
> thread.  Once you are back on your thread, you can do just about anything.
> 
> Because you need to do IO, you will not be able to do a simple bio "bounce
> redirect".  You will need to do the IO youself (ie, call another make
> request), but you can use the callers bvec for this, so there is no data
> copy required.  Once the request completes, you can then fin the caller.

Oh, wow. This sounds truly terrifying. Let's dive in!

I tried to read your hints one word at a time. So here's the somewhat
pseudocodish solution to my homework:

struct bio *b = bio_alloc(GFP_NOIO, 1);
b->bi_size = 8;
bio_alloc_pages(b, GFP_NOIO);
b->bi_sector = 1234;
b->bi_bdev = lc->metadev->bdev;
b->bi_rw = READ;
b->bi_private = local_ctx;
b->bi_end_io = read_complete_callback;
generic_make_request(bi);

static void read_complete_callback(struct bio *b, int error) {
  // ???
  printk(KERN_INFO "First read byte: %02x\n",
     b->bi_io_vec[0]->bv_page[0]);
}

So I hope this is even remotely close to what I should end up with.

This will alloc a new bio with, as I understand it, one page buffer in
b->bi_io_vec. This buffer is then allocated with bio_alloc_pages to 8
sectors in size (i.e. exactly one page of 4096 bytes). Then the read
address, block device and read mode is set. I pass some kind of local
context so I can do something meaningful in the callback and specify the
callback function. Then I execute the request.

As I understand, this executes asynchronously. So here comes the
threading into play, right? Just pseudocode (because I can't judge how
far I'm off here), but let's say this is map():

void read_complete_callback() {
    semaphore_inc(local_ctx);
}

void map() {
   local_ctx->semaphore->value = 0;

   // Issue read as above
   generic_make_request(bi);

   semaphore_dec(&local_ctx->semaphore);

   // Now the concurrent async IO has finished and we interpret the data
   [...]
}

Oh boy I really don't know if this is even remotely close. Any hints, as
easy as they may seem to you guys, are really greatly appreciated. I've
never worked with this stuff.

> ​If you cannot continue because devices are not present or the right size,
> yes you should fail the ctr routine.

Alright!

> If you want to setup /proc or other monitoring stuff, you can use the init
> routine, probably plus some statics, to setup "views" into your module.  If
> you want to support multiple instances (and you should), setup a
> /proc/{yourname} directory on the init and then populate it with
> sub-directories every time you create a device.

Okay, I'll try to do this (want to make statistics available via procfs
later on), but one construction site at a time for me.

>> - Can I determine the size the bio in map() will have already in ctr()
>> somehow? Can I assume it will never change if it was once determined?
>> The reason is that for my example I need to make sure the chunk size is
>> a integer multiple of the bio size and I would only like to check this
>> once (in ctr) and not every time (in map).
> 
> ​Block size will not change.  The size of requests to you is limited by the
> setup of ti->max_io_len.  If you don't set this with recent kernels, you
> will only get 4K, which is not all that efficient.  This is actually part
> of another big topic of "stacked limits", which someone could write a book
> on (and I would read it).​

So if I would want to do a large I/O operation (say write one megabyte
of data to a block device somewhere within my driver) I'd have to make
lots of calls to generic_make_request?

Thank you so much for your help,
Best regards,
Johannes

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-16 18:54   ` Johannes Bauer
@ 2015-06-16 19:37     ` Alasdair G Kergon
  2015-06-16 21:05     ` Doug Dumitru
  1 sibling, 0 replies; 9+ messages in thread
From: Alasdair G Kergon @ 2015-06-16 19:37 UTC (permalink / raw)
  To: Johannes Bauer; +Cc: dm-devel, doug

See whether you can use dm-bufio for what you need.

Alasdair

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-15 18:39 Newbie device mapper questions Johannes Bauer
  2015-06-15 19:52 ` Doug Dumitru
@ 2015-06-16 19:46 ` Alasdair G Kergon
  1 sibling, 0 replies; 9+ messages in thread
From: Alasdair G Kergon @ 2015-06-16 19:46 UTC (permalink / raw)
  To: Johannes Bauer; +Cc: device-mapper development

On Mon, Jun 15, 2015 at 08:39:54PM +0200, Johannes Bauer wrote:
> Something that I would like to implement first is a device mapper target
> that takes three block devices as input: Two equally sized devices (src1
> and src2) and a separate metadata device (meta). I want to map chunks of
> the src devices to bits of a bitmap in the meta device. If the bit is
> set in the meta device decides whether src1 and src2 is returned.
 
The dm-switch target already provides this functionality under the control of
userspace.  Userspace can maintain/monitor your metadata device and send
messages to the kernel whenever there are changes.

Alasdair

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Newbie device mapper questions
  2015-06-16 18:54   ` Johannes Bauer
  2015-06-16 19:37     ` Alasdair G Kergon
@ 2015-06-16 21:05     ` Doug Dumitru
  1 sibling, 0 replies; 9+ messages in thread
From: Doug Dumitru @ 2015-06-16 21:05 UTC (permalink / raw)
  To: Johannes Bauer; +Cc: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 7206 bytes --]

Johannes,

I was not trying to scare you, just tell you the rough path.  I have been
exactly where you are.

On Tue, Jun 16, 2015 at 11:54 AM, Johannes Bauer <dfnsonfsduifb@gmx.de>
wrote:

> On 15.06.2015 21:52, Doug Dumitru wrote:
>
> >> Sounds pretty easy and I also got surprisingly far with my little kernel
> >> module. I've so far implemented ctr, dtr, map and status.
> >
> > ​Congratulations, you are actually a long way there.​
>
> Thanks but I think I have the mountain still ahead -- still, I would
> really like to figure out the nitty-gritty.
>
> > ​You have to allocate a bio, populate it, allocate pages for buffer,
> > populate the bvec, and call make_request (or generic make request).  You
> > will get the completion from the bio on the bottom half of the interrupt
> > handler, so how much work you can do there is debatable.  You cannot
> start
> > an new IO from there, which you need to.  You will probably want to
> start a
> > helper thread and have the completion routine schedule itself onto your
> > thread.  Once you are back on your thread, you can do just about
> anything.
> >
> > Because you need to do IO, you will not be able to do a simple bio
> "bounce
> > redirect".  You will need to do the IO youself (ie, call another make
> > request), but you can use the callers bvec for this, so there is no data
> > copy required.  Once the request completes, you can then fin the caller.
>
> Oh, wow. This sounds truly terrifying. Let's dive in!
>
> I tried to read your hints one word at a time. So here's the somewhat
> pseudocodish solution to my homework:
>
> struct bio *b = bio_alloc(GFP_NOIO, 1);
> b->bi_size = 8;
> bio_alloc_pages(b, GFP_NOIO);
> b->bi_sector = 1234;
> b->bi_bdev = lc->metadev->bdev;
> b->bi_rw = READ;
> b->bi_private = local_ctx;
> b->bi_end_io = read_complete_callback;
> generic_make_request(bi);
>

​size is in bytes

biovec count is in pages

you will need to allocate local_ctx (it cannot be on the stack).  You
probably need to allocate a structure in the .ctr routine that is your
"device context".  Each "operation" then gets it's own alloc that points
back to the device context.
​


>
> static void read_complete_callback(struct bio *b, int error) {
>   // ???
>   printk(KERN_INFO "First read byte: %02x\n",
>      b->bi_io_vec[0]->bv_page[0]);
> }
>

​Here you usally do:

Q_WORK *q = bio->bi_private
DEV *dev = q->dev;

to get your context back.​


>
> So I hope this is even remotely close to what I should end up with.
>
> This will alloc a new bio with, as I understand it, one page buffer in
> b->bi_io_vec. This buffer is then allocated with bio_alloc_pages to 8
> sectors in size (i.e. exactly one page of 4096 bytes). Then the read
> address, block device and read mode is set. I pass some kind of local
> context so I can do something meaningful in the callback and specify the
> callback function. Then I execute the request.
>
> As I understand, this executes asynchronously. So here comes the
> threading into play, right? Just pseudocode (because I can't judge how
> far I'm off here), but let's say this is map():
>
> void read_complete_callback() {
>     semaphore_inc(local_ctx);
> }
>
> void map() {
>    local_ctx->semaphore->value = 0;
>
>    // Issue read as above
>    generic_make_request(bi);
>
>    semaphore_dec(&local_ctx->semaphore);
>
>    // Now the concurrent async IO has finished and we interpret the data
>    [...]
> }
>

​It is more like:  (really psuedo code)

thread_helper(...)

DEV *dev = thread_param;
while ( 1 ) {
  if ( dev->shutdown_flg ) break;
  spinlock(dev->workqueue_slock,flags);
  if ( dev->workqueue_head) {
    q = dev->workqueue_head;
    dev->workqueue_head = dev->workqueue_head->nxt;
    spinunlock(dev->workqueue_slock,flags);
    ... process q work
    continue;
  }
  spinunlock(dev->workqueue_slock,flags);
  down(dev->workqueue_sem);
}

You have to start the thread, setup a semaphore and spinlock.  Better is to
use a waitq, but semaphores do work.

When you want to schedule on the background, you add your new "q" item to
the head/tail single linked list.  A double linked list is fine and easier
to program, but overkill.
    ​


>
> Oh boy I really don't know if this is even remotely close. Any hints, as
> easy as they may seem to you guys, are really greatly appreciated. I've
> never worked with this stuff.
>

​Start by creating a thread in module load and destroying it in module
unload.  You can use statics as the DEV.  You should use atomics as "thread
counters", so when the thread starts, it increments the "running thread
count".  When a thread exits, it decrements the counter.  This way, the
module unload routine can set the "dev->shutddown_flag", do a bunch of
up(...) to wake up the threads, and then wait for the threads to exit by
watching the counter.  Throw in some sleeps to keep the loop waiting for
exits from killing the box.

If you do it correctly, you can start a bunch of copies of the worker
thread.  If you are after a lot of bandwidth or IOPS, this might be
helpful.  Otherwise, you can probably get away with just one.  Having just
one helper is nice because you don't have to set as many locks to protect
yourself from yourself.

Once you have your first live thread, you can build a queue to give it work
to do.  Once you have work you can give it, you are off to the races.
​


>
> > ​If you cannot continue because devices are not present or the right
> size,
> > yes you should fail the ctr routine.
>
> Alright!
>
> > If you want to setup /proc or other monitoring stuff, you can use the
> init
> > routine, probably plus some statics, to setup "views" into your module.
> If
> > you want to support multiple instances (and you should), setup a
> > /proc/{yourname} directory on the init and then populate it with
> > sub-directories every time you create a device.
>
> Okay, I'll try to do this (want to make statistics available via procfs
> later on), but one construction site at a time for me.
>
> >> - Can I determine the size the bio in map() will have already in ctr()
> >> somehow? Can I assume it will never change if it was once determined?
> >> The reason is that for my example I need to make sure the chunk size is
> >> a integer multiple of the bio size and I would only like to check this
> >> once (in ctr) and not every time (in map).
> >
> > ​Block size will not change.  The size of requests to you is limited by
> the
> > setup of ti->max_io_len.  If you don't set this with recent kernels, you
> > will only get 4K, which is not all that efficient.  This is actually part
> > of another big topic of "stacked limits", which someone could write a
> book
> > on (and I would read it).​
>
> So if I would want to do a large I/O operation (say write one megabyte
> of data to a block device somewhere within my driver) I'd have to make
> lots of calls to generic_make_request?
>
> Thank you so much for your help,
> Best regards,
> Johannes
>



-- 
Doug Dumitru
EasyCo LLC

[-- Attachment #1.2: Type: text/html, Size: 12155 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-16 21:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-15 18:39 Newbie device mapper questions Johannes Bauer
2015-06-15 19:52 ` Doug Dumitru
2015-06-15 22:20   ` Vivek Goyal
2015-06-16  0:55     ` Minfei Huang
2015-06-16  1:18       ` Minfei Huang
2015-06-16 18:54   ` Johannes Bauer
2015-06-16 19:37     ` Alasdair G Kergon
2015-06-16 21:05     ` Doug Dumitru
2015-06-16 19:46 ` Alasdair G Kergon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.