From: Andrew Morton <akpm@linux-foundation.org>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel@vger.kernel.org,
Christian Borntraeger <borntraeger@de.ibm.com>,
"Eric W. Biederman" <ebiederm@xmission.com>,
rob@landley.net, Jens Axboe <axboe@kernel.dk>
Subject: Re: [patch] rewrite rd
Date: Mon, 3 Dec 2007 22:29:03 -0800 [thread overview]
Message-ID: <20071203222903.c820707b.akpm@linux-foundation.org> (raw)
In-Reply-To: <20071204042628.GA26636@wotan.suse.de>
On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <npiggin@suse.de> wrote:
> Hi,
>
> This is my proposal for a (hopefully) backwards compatible rd driver.
> The motivation for me is not pressing, except that I have this code
> sitting here that is either going to rot or get merged. I'm happy to
> make myself maintainer of this code, but if any real block device
> driver writer would like to, that would be fine by me ;)
>
> Comments?
>
> --
>
> This is a rewrite of the ramdisk block device driver.
>
> The old one is really difficult because it effectively implements a block
> device which serves data out of its own buffer cache. It relies on the dirty
> bit being set, to pin its backing store in cache, however there are non
> trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
> which had recently lead to data corruption. And in general it is completely
> wrong for a block device driver to do this.
>
> The new one is more like a regular block device driver. It has no idea
> about vm/vfs stuff. It's backing store is similar to the buffer cache
> (a simple radix-tree of pages), but it doesn't know anything about page
> cache (the pages in the radix tree are not pagecache pages).
>
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.
what about mmap(/dev/ram0)?
> The fact that it now goes through all the regular vm/fs paths makes it
> much more useful for testing, too.
>
> text data bss dec hex filename
> 2837 849 384 4070 fe6 drivers/block/rd.o
> 3528 371 12 3911 f47 drivers/block/brd.o
>
> Text is larger, but data and bss are smaller, making total size smaller.
>
> A few other nice things about it:
> - Similar structure and layout to the new loop device handlinag.
> - Dynamic ramdisk creation.
> - Runtime flexible buffer head size (because it is no longer part of the
> ramdisk code).
> - Boot / load time flexible ramdisk size, which could easily be extended
> to a per-ramdisk runtime changeable size (eg. with an ioctl).
This ramdisk driver can use highmem whereas the existing one can't (yes?).
That's a pretty major difference. Plus look at the revoltingness in rd.c's
mapping_set_gfp_mask()s.
> +++ linux-2.6/drivers/block/brd.c
> @@ -0,0 +1,536 @@
> +/*
> + * Ram backed block device driver.
> + *
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + *
> + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
> + * of their respective owners.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/major.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/highmem.h>
> +#include <linux/gfp.h>
> +#include <linux/radix-tree.h>
> +#include <linux/buffer_head.h> /* invalidate_bh_lrus() */
> +
> +#include <asm/uaccess.h>
> +
> +#define SECTOR_SHIFT 9
That's our third definition of SECTOR_SHIFT.
> +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT)
> +
> +/*
> + * Each block ramdisk device has a radix_tree brd_pages of pages that stores
> + * the pages containing the block device's contents. A brd page's ->index is
> + * its offset in PAGE_SIZE units. This is similar to, but in no way connected
> + * with, the kernel's pagecache or buffer cache (which sit above our block
> + * device).
> + */
> +struct brd_device {
> + int brd_number;
> + int brd_refcnt;
> + loff_t brd_offset;
> + loff_t brd_sizelimit;
> + unsigned brd_blocksize;
> +
> + struct request_queue *brd_queue;
> + struct gendisk *brd_disk;
> + struct list_head brd_list;
> +
> + /*
> + * Backing store of pages and lock to protect it. This is the contents
> + * of the block device.
> + */
> + spinlock_t brd_lock;
> + struct radix_tree_root brd_pages;
> +};
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + */
> +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
> +{
> + unsigned long idx;
Could use pgoff_t here if you think that's clearer.
> + struct page *page;
> +
> + /*
> + * The page lifetime is protected by the fact that we have opened the
> + * device node -- brd pages will never be deleted under us, so we
> + * don't need any further locking or refcounting.
> + *
> + * This is strictly true for the radix-tree nodes as well (ie. we
> + * don't actually need the rcu_read_lock()), however that is not a
> + * documented feature of the radix-tree API so it is better to be
> + * safe here (we don't have total exclusion from radix tree updates
> + * here, only deletes).
> + */
> + rcu_read_lock();
> + idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
> + page = radix_tree_lookup(&brd->brd_pages, idx);
> + rcu_read_unlock();
> +
> + BUG_ON(page && page->index != idx);
> +
> + return page;
> +}
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + * If one does not exist, allocate an empty page, and insert that. Then
> + * return it.
> + */
> +static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
> +{
> + unsigned long idx;
> + struct page *page;
> +
> + page = brd_lookup_page(brd, sector);
> + if (page)
> + return page;
> +
> + page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
Why GFP_NOIO?
Have you thought about __GFP_MOVABLE treatment here? Seems pretty
desirable as this sucker can be large.
> + if (!page)
> + return NULL;
> +
> + if (radix_tree_preload(GFP_NOIO)) {
> + __free_page(page);
> + return NULL;
> + }
> +
> + spin_lock(&brd->brd_lock);
> + idx = sector >> PAGE_SECTORS_SHIFT;
> + if (radix_tree_insert(&brd->brd_pages, idx, page)) {
> + __free_page(page);
> + page = radix_tree_lookup(&brd->brd_pages, idx);
> + BUG_ON(!page);
> + BUG_ON(page->index != idx);
> + } else
> + page->index = idx;
> + spin_unlock(&brd->brd_lock);
> +
> + radix_tree_preload_end();
> +
> + return page;
> +}
> +
> +/*
> + * Free all backing store pages and radix tree. This must only be called when
> + * there are no other users of the device.
> + */
> +#define FREE_BATCH 16
> +static void brd_free_pages(struct brd_device *brd)
> +{
> + unsigned long pos = 0;
> + struct page *pages[FREE_BATCH];
> + int nr_pages;
> +
> + do {
> + int i;
> +
> + nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
> + (void **)pages, pos, FREE_BATCH);
> +
> + for (i = 0; i < nr_pages; i++) {
> + void *ret;
> +
> + BUG_ON(pages[i]->index < pos);
> + pos = pages[i]->index;
> + ret = radix_tree_delete(&brd->brd_pages, pos);
> + BUG_ON(!ret || ret != pages[i]);
> + __free_page(pages[i]);
> + }
> +
> + pos++;
> +
> + } while (nr_pages == FREE_BATCH);
> +}
I have vague memories that radix_tree_gang_lookup()'s "naive"
implementation may return fewer items than you asked for even when there
are more items remaining - when it hits certain boundaries.
> +/*
> + * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
> + */
> +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
> +{
> + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> + size_t copy;
> +
> + copy = min((unsigned long)n, PAGE_SIZE - offset);
use min_t. Or, better, sort out the types.
> + if (!brd_insert_page(brd, sector))
> + return -ENOMEM;
> + if (copy < n) {
> + sector += copy >> SECTOR_SHIFT;
> + if (!brd_insert_page(brd, sector))
> + return -ENOMEM;
> + }
> + return 0;
> +}
> +
> +/*
> + * Copy n bytes from src to the brd starting at sector. Does not sleep.
> + */
> +static void copy_to_brd(struct brd_device *brd, const void *src,
> + sector_t sector, size_t n)
> +{
> + struct page *page;
> + void *dst;
> + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> + size_t copy;
> +
> + copy = min((unsigned long)n, PAGE_SIZE - offset);
Ditto.
> + page = brd_lookup_page(brd, sector);
> + BUG_ON(!page);
> +
> + dst = kmap_atomic(page, KM_USER1);
> + memcpy(dst + offset, src, copy);
> + kunmap_atomic(dst, KM_USER1);
Might need flush_dcache_page() if mmap gets sorted out.
> + if (copy < n) {
> + src += copy;
> + sector += copy >> SECTOR_SHIFT;
> + copy = n - copy;
> + page = brd_lookup_page(brd, sector);
> + BUG_ON(!page);
> +
> + dst = kmap_atomic(page, KM_USER1);
> + memcpy(dst, src, copy);
> + kunmap_atomic(dst, KM_USER1);
> + }
> +}
> +
> +/*
> + * Copy n bytes to dst from the brd starting at sector. Does not sleep.
> + */
> +static void copy_from_brd(void *dst, struct brd_device *brd,
> + sector_t sector, size_t n)
> +{
> + struct page *page;
> + const void *src;
> + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> + size_t copy;
> +
> + copy = min((unsigned long)n, PAGE_SIZE - offset);
tritto.
> + page = brd_lookup_page(brd, sector);
> + if (page) {
> + src = kmap_atomic(page, KM_USER1);
> + memcpy(dst, src + offset, copy);
> + kunmap_atomic(src, KM_USER1);
> + } else
> + memset(dst, 0, copy);
> +
> + if (copy < n) {
> + dst += copy;
> + sector += copy >> SECTOR_SHIFT;
> + copy = n - copy;
> + page = brd_lookup_page(brd, sector);
> + if (page) {
> + src = kmap_atomic(page, KM_USER1);
> + memcpy(dst, src, copy);
> + kunmap_atomic(src, KM_USER1);
> + } else
> + memset(dst, 0, copy);
> + }
> +}
> +
> +/*
> + * Process a single bvec of a bio.
> + */
> +static int brd_do_bvec(struct brd_device *brd, struct page *page,
> + unsigned int len, unsigned int off, int rw,
> + sector_t sector)
> +{
> + void *mem;
> + int err = 0;
> +
> + if (rw != READ) {
> + err = copy_to_brd_setup(brd, sector, len);
> + if (err)
> + goto out;
> + }
> +
> + mem = kmap_atomic(page, KM_USER0);
> + if (rw == READ) {
> + copy_from_brd(mem + off, brd, sector, len);
> + flush_dcache_page(page);
hm, there's a flush_dcache_page(). I guess you've throught it through ;)
> + } else
> + copy_to_brd(brd, mem + off, sector, len);
> + kunmap_atomic(mem, KM_USER0);
> +
> +out:
> + return err;
> +}
> +
>
> ...
>
> +static int brd_ioctl(struct inode *inode, struct file *file,
> + unsigned int cmd, unsigned long arg)
> +{
> + int error;
> + struct block_device *bdev = inode->i_bdev;
> + struct brd_device *brd = bdev->bd_disk->private_data;
> +
> + if (cmd != BLKFLSBUF)
> + return -ENOTTY;
> +
> + /*
> + * ram device BLKFLSBUF has special semantics, we want to actually
> + * release and destroy the ramdisk data.
> + */
> + mutex_lock(&bdev->bd_mutex);
> + error = -EBUSY;
> + if (bdev->bd_openers <= 1) {
> + /*
> + * Invalidate the cache first, so it isn't written
> + * back to the device.
> + */
> + invalidate_bh_lrus();
> + truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
hm, some other thread can instantiate pagecache here. I guess it's always
been like that and there's not a lot we can (or should) do about it.
> + brd_free_pages(brd);
> + error = 0;
> + }
> + mutex_unlock(&bdev->bd_mutex);
> +
> + return error;
> +}
> +
>
> ...
>
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1436,6 +1436,7 @@ void invalidate_bh_lrus(void)
> {
> on_each_cpu(invalidate_bh_lru, NULL, 1, 1);
> }
> +EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
Maybe create a new helper function which does
invalidate_bh_lrus()+truncate_inode_pages(), call that from kill_bdev() and
here, make invalidate_bh_lrus() static.
That's a separate patch, I guess.
next prev parent reply other threads:[~2007-12-04 6:30 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-04 4:26 [patch] rewrite rd Nick Piggin
2007-12-04 6:29 ` Andrew Morton [this message]
2007-12-04 7:01 ` Nick Piggin
2007-12-04 7:08 ` Nick Piggin
2007-12-04 7:55 ` Rob Landley
2007-12-04 9:29 ` Nick Piggin
2007-12-04 19:53 ` Rob Landley
2007-12-04 9:54 ` Christian Borntraeger
2007-12-04 10:10 ` Nick Piggin
2007-12-04 11:21 ` [patch] rd: support XIP Nick Piggin
2007-12-04 11:23 ` [patch] ext2: xip check fix Nick Piggin
2007-12-05 15:43 ` Carsten Otte
2007-12-05 23:33 ` Nick Piggin
2007-12-06 8:43 ` Carsten Otte
2007-12-06 8:52 ` Nick Piggin
2007-12-06 9:59 ` Carsten Otte
2007-12-06 10:18 ` Nick Piggin
2007-12-06 10:24 ` Carsten Otte
2007-12-06 18:11 ` Rob Landley
2007-12-07 3:22 ` Jared Hulbert
2007-12-07 4:17 ` Rob Landley
2007-12-07 4:23 ` Nick Piggin
2007-12-07 4:40 ` Jared Hulbert
2007-12-07 8:59 ` Carsten Otte
2007-12-07 9:52 ` Jared Hulbert
2007-12-04 11:26 ` [patch] rd: support XIP Andrew Morton
2007-12-04 11:35 ` Nick Piggin
2007-12-04 13:00 ` [patch] mm: fix XIP file writes Nick Piggin
2007-12-10 14:38 ` Christian Borntraeger
2007-12-12 4:03 ` Nick Piggin
2007-12-04 12:06 ` [patch] rd: support XIP Duane Griffin
2007-12-04 13:03 ` [patch] rd: support XIP (updated) Nick Piggin
2008-01-14 16:47 ` [patch] rewrite rd Matthew Wilcox
2008-01-14 17:21 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071203222903.c820707b.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=borntraeger@de.ibm.com \
--cc=ebiederm@xmission.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=npiggin@suse.de \
--cc=rob@landley.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).