aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/raid0.c
AgeCommit message (Collapse)Author
2020-06-25md: add feature flag MD_FEATURE_RAID0_LAYOUTNeilBrown
[ Upstream commit 33f2c35a54dfd75ad0e7e86918dcbe4de799a56c ] Due to a bug introduced in Linux 3.14 we cannot determine the correctly layout for a multi-zone RAID0 array - there are two possibilities. It is possible to tell the kernel which to chose using a module parameter, but this can be clumsy to use. It would be best if the choice were recorded in the metadata. So add a feature flag for this purpose. If it is set, then the 'layout' field of the superblock is used to determine which layout to use. If this flag is not set, then mddev->layout gets set to -1, which causes the module parameter to be required. Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-23block: fix an integer overflow in logical block sizeMikulas Patocka
commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream. Logical block size has type unsigned short. That means that it can be at most 32768. However, there are architectures that can run with 64k pages (for example arm64) and on these architectures, it may be possible to create block devices with 64k block size. For exmaple (run this on an architecture with 64k pages): Mount will fail with this error because it tries to read the superblock using 2-sector access: device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536 EXT4-fs (dm-0): unable to read superblock This patch changes the logical block size from unsigned short to unsigned int to avoid the overflow. Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17md: improve handling of bio with REQ_PREFLUSH in md_flush_request()David Jeffery
commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream. If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: 2bc13b83e629 ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown <neilb@suse.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-13md/raid0: Fix an error message in raid0_make_request()Dan Carpenter
[ Upstream commit e3fc3f3d0943b126f76b8533960e4168412d9e5a ] The first argument to WARN() is supposed to be a condition. The original code will just print the mdname() instead of the full warning message. Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-29md/raid0: fix warning message for parameter default_layoutSong Liu
[ Upstream commit 3874d73e06c9b9dc15de0b7382fc223986d75571 ] The message should match the parameter, i.e. raid0.default_layout. Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Cc: NeilBrown <neilb@suse.de> Reported-by: Ivan Topolsky <doktor.yak@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05md/raid0: avoid RAID0 data corruption due to layout confusion.NeilBrown
[ Upstream commit c84a1372df929033cb1a0441fb57bd3932f39ac9 ] If the drives in a RAID0 are not all the same size, the array is divided into zones. The first zone covers all drives, to the size of the smallest. The second zone covers all drives larger than the smallest, up to the size of the second smallest - etc. A change in Linux 3.14 unintentionally changed the layout for the second and subsequent zones. All the correct data is still stored, but each chunk may be assigned to a different device than in pre-3.14 kernels. This can lead to data corruption. It is not possible to determine what layout to use - it depends which kernel the data was written by. So we add a module parameter to allow the old (0) or new (1) layout to be specified, and refused to assemble an affected array if that parameter is not set. Fixes: 20d0189b1012 ("block: Introduce new bio_split()") cc: stable@vger.kernel.org (3.14+) Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-10md/raid0: Do not bypass blocking queue entered for raid0 biosGuilherme G. Piccoli
----------------------------------------------------------------- This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that. ----------------------------------------------------------------- Commit cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") introduced the flag BIO_QUEUE_ENTERED in order split bios bypass the blocking queue entering routine and use the live non-blocking version. It was a result of an extensive discussion in a linux-block thread[0], and the purpose of this change was to prevent a hung task waiting on a reference to drop. Happens that md raid0 split bios all the time, and more important, it changes their underlying device to the raid member. After the change introduced by this flag's usage, we experience various crashes if a raid0 member is removed during a large write. This happens because the bio reaches the live queue entering function when the queue of the raid0 member is dying. A simple reproducer of this behavior is presented below: a) Build kernel v4.19.56-stable with CONFIG_BLK_DEV_THROTTLING=y. b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem. c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member) This will trigger the following warning/oops: ------------[ cut here ]------------ BUG: unable to handle kernel NULL pointer dereference at 0000000000000155 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:blk_throtl_bio+0x45/0x970 [...] Call Trace: generic_make_request_checks+0x1bf/0x690 generic_make_request+0x64/0x3f0 raid0_make_request+0x184/0x620 [raid0] ? raid0_make_request+0x184/0x620 [raid0] md_handle_request+0x126/0x1a0 md_make_request+0x7b/0x180 generic_make_request+0x19e/0x3f0 submit_bio+0x73/0x140 [...] This patch changes raid0 driver to fallback to the "old" blocking queue entering procedure, by clearing the BIO_QUEUE_ENTERED from raid0 bios. This prevents the crashes and restores the regular behavior of raid0 arrays when a member is removed during a large write. [0] lore.kernel.org/linux-block/343bbbf6-64eb-879e-d19e-96aebb037d47@I-love.SAKURA.ne.jp ---------------------------- Why this is not on mainline? ---------------------------- The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[1] observed that there was a clean-up series ready to be accepted from Ming Lei[2] that fixed the same issue. The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving: 1) Putting back blk_exit_queue() logic into __blk_release_queue(); that path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue(). 2) Removing the guard/protection in generic_make_request_checks() with blk_queue_enter(). The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit. [1] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [2] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-12treewide: kzalloc() -> kcalloc()Kees Cook
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-05-30md: convert to bioset_init()/mempool_init()Kent Overstreet
Convert md to embedded bio sets. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()Bart Van Assche
This patch has been generated as follows: for verb in set_unlocked clear_unlocked set clear; do replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \ $(git grep -lw queue_flag_${verb} drivers block/bsg*) done Except for protecting all queue flag changes with the queue lock this patch does not change any functionality. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01md: remove special meaning of ->quiesce(.., 2)NeilBrown
The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-09-07Merge tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD updates from Shaohua Li: "This update mainly fixes bugs: - Make raid5 ppl support several ppl from Pawel - Several raid5-cache bug fixes from Song - Bitmap fixes from Neil and Me - One raid1/10 regression fix since 4.12 from Me - Other small fixes and cleanup" * tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/bitmap: disable bitmap_resize for file-backed bitmaps. raid5-ppl: Recovery support for multiple partial parity logs md: Runtime support for multiple ppls md/raid0: attach correct cgroup info in bio lib/raid6: align AVX512 constants to 512 bits, not bytes raid5: remove raid5_build_block md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show md: replace seq_release_private with seq_release md: notify about new spare disk in the container md/raid1/10: reset bio allocated from mempool md/raid5: release/flush io in raid5_do_work() md/bitmap: copy correct data for bitmap super
2017-08-28md: Runtime support for multiple pplsPawel Baldysiak
Increase PPL area to 1MB and use it as circular buffer to store PPL. The entry with highest generation number is the latest one. If PPL to be written is larger then space left in a buffer, rewind the buffer to the start (don't wrap it). Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md/raid0: attach correct cgroup info in bioShaohua Li
The discard bio doesn't attach the original bio cgroup info. Normal bio is cloned, so is fine. Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-13md: fix deadlock between mddev_suspend() and md_write_start()NeilBrown
If mddev_suspend() races with md_write_start() we can deadlock with mddev_suspend() waiting for the request that is currently in md_write_start() to complete the ->make_request() call, and md_write_start() waiting for the metadata to be updated to mark the array as 'dirty'. As metadata updates done by md_check_recovery() only happen then the mddev_lock() can be claimed, and as mddev_suspend() is often called with the lock held, these threads wait indefinitely for each other. We fix this by having md_write_start() abort if mddev_suspend() is happening, and ->make_request() aborts if md_write_start() aborted. md_make_request() can detect this abort, decrease the ->active_io count, and wait for mddev_suspend(). Reported-by: Nix <nix@esperi.org.uk> Fix: 68866e425be2(MD: no sync IO while suspended) Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-08md/md0: optimize raid0 discard handlingShaohua Li
There are complaints that raid0 discard handling is slow. Currently we divide discard request into chunks and dispatch to underlayer disks. The block layer will do merge to form big requests. This causes a lot of request split/merge and uses significant CPU time. A simple idea is to calculate the range for each raid disk for an IO request and send a discard request to raid disks, which will avoid the split/merge completely. Previously Coly tried the approach, but the implementation was too complex because of raid0 zones. This patch always split bio in zone boundary and handle bio within one zone. It simplifies the implementation a lot. Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-01Merge branch 'md-next' into md-linusShaohua Li
2017-04-11md/raid0: fix up bio splitting.NeilBrown
raid0_make_request() should use a private bio_set rather than the shared fs_bio_set, which is only meant for filesystems to use. raid0_make_request() shouldn't loop around using the bio_set multiple times as that can deadlock. So use mddev->bio_set and pass the tail to generic_make_request() instead of looping on it. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-08md: support REQ_OP_WRITE_ZEROESChristoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-16md: superblock changes for PPLArtur Paszkiewicz
Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-24Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...
2017-02-13md: disable WRITE SAME if it fails in underlayer disksShaohua Li
This makes md do the same thing as dm for write same IO failure. Please see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need this. We did a little bit different than dm. Instead of disabling writesame in the first IO error, we disable it till next writesame IO coming after the first IO error. This way we don't need to clone a bio. Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581 Suggested-by: NeilBrown <neilb@suse.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-02block: Use pointer to backing_dev_info from request_queueJan Kara
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-05md: cleanup mddev flag clear for takeoverShaohua Li
Commit 6995f0b (md: takeover should clear unrelated bits) clear unrelated bits, but it's quite fragile. To avoid error in the future, define a macro for unsupported mddev flags for each raid type and use it to clear unsupported mddev flags. This should be less error-prone. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08md: takeover should clear unrelated bitsShaohua Li
When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit will be accidentally set, but raid5 doesn't support it. The same is true for the MD_HAS_JOURNAL bit. Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md: add block tracing for bio_remappingNeilBrown
The block tracing infrastructure (accessed with blktrace/blkparse) supports the tracing of mapping bios from one device to another. This is currently used when a bio in a partition is mapped to the whole device, when bios are mapped by dm, and for mapping in md/raid5. Other md personalities do not include this tracing yet, so add it. When a read-error is detected we redirect the request to a different device. This could justifiably be seen as a new mapping for the originial bio, or a secondary mapping for the bio that errors. This patch uses the second option. When md is used under dm-raid, the mappings are not traced as we do not have access to the block device number of the parent. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid0: replace printk() with pr_*()NeilBrown
This makes md/raid0 much less verbose as the messages about the array geometry are now pr_debug() Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07md: use bio op accessorsMike Christie
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-25md/raid0: remove empty line printk from dump_zonesMichał Pecio
Remove the final printk. All preceding output is already properly newline-terminated and the printk isn't even KERN_CONT to begin with, so it only adds one empty line to the log. Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-14md/raid0: fix uninitialized variable bugDan Carpenter
If this function fails the callers expect that *private_conf is set to an ERR_PTR() but that isn't true for the first error path where we can't allocate "conf". It leads to some uninitialized variable bugs. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com>
2015-12-08treewide: Fix typos in printkMasanari Iida
This patch fix multiple spelling typos found in various part of kernel. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-10-02md/raid0: apply base queue limits *before* disk_stack_limitsNeilBrown
Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: 199dc6ed5179 ("md/raid0: update queue parameter in a safer location.") Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed5179). Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-09-05Merge linux-block/for-4.3/core into md/for-linuxNeilBrown
There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13block: kill merge_bvec_fn() completelyKent Overstreet
As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-03md/raid0: update queue parameter in a safer location.NeilBrown
When a (e.g.) RAID5 array is reshaped to RAID0, the updating of queue parameters (e.g. max number of sectors per bio) is done in the wrong place. It should be part of ->run, but it is actually part of ->takeover. This means it happens before level_store() calls: blk_set_stacking_limits(&mddev->queue->limits); and so it ineffective. This can lead to errors from underlying devices. So move all the relevant settings out of create_stripe_zones() and into raid0_run(). As this can lead to a bug-on it is suitable for any -stable kernel which supports reshape to RAID0. So 2.6.35 or later. As the bug has been present for five years there is no urgency, so no need to rush into -stable. Fixes: 9af204cf720c ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover") Cc: stable@vger.kernel.org (v2.6.35+ - please delay until after -final release). Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-21md/raid0: fix restore to sector variable in raid0_make_requestEric Work
The variable "sector" in "raid0_make_request()" was improperly updated by a call to "sector_div()" which modifies its first argument in place. Commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd restored this variable after the call for later re-use. Unfortunetly the restore was done after the referenced variable "bio" was advanced. This lead to the original value and the restored value being different. Here we move this line to the proper place. One observed side effect of this bug was discarding a file though unlinking would cause an unrelated file's contents to be discarded. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: 47d68979cc96 ("md/raid0: fix bug with chunksize not a power of 2.") Cc: stable@vger.kernel.org (any that received above backport) URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501
2015-05-08md-raid0: conditional mddev->queue access to suit dm-raidHeinz Mauelshagen
This patch is a prerequisite for dm-raid "raid0" support to allow dm-raid to access the MD RAID0 personality doing unconditional accesses to mddev->queue, which is NULL in case of dm-raid stacked on top of MD. Most of the conditional mddev->queue accesses made it to upstream but this missing one, which prohibits md raid0 to set disk stack limits (being done in dm core in case of md underneath dm). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md raid0: access mddev->queue (request queue member) conditionally because ↵Heinz Mauelshagen
it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-10md/raid0: fix bug with chunksize not a power of 2.NeilBrown
Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 in v3.14-rc1 RAID0 has performed incorrect calculations when the chunksize is not a power of 2. This happens because "sector_div()" modifies its first argument, but this wasn't taken into account in the patch. So restore that first arg before re-using the variable. Reported-by: Joe Landman <joe.landman@gmail.com> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: stable@vger.kernel.org (3.14 and later). Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21md: fix problems with freeing private data after ->run failure.NeilBrown
If ->run() fails, it can either free the data structures it allocated, or leave that task to ->free() which will be called on failures. However: md.c calls ->free() even if ->private_data is NULL, which causes problems in some personalities. raid0.c frees the data, but doesn't clear ->private_data, which will become a problem when we fix md.c So better fix both these issues at once. Reported-by: Richard W.M. Jones <rjones@redhat.com> Fixes: 5aa61f427e4979be733e4847b9199ff9cc48a47e URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381 Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: rename ->stop to ->freeNeilBrown
Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: split detach operation out from ->stop.NeilBrown
Each md personality has a 'stop' operation which does two things: 1/ it finalizes some aspects of the array to ensure nothing is accessing the ->private data 2/ it frees the ->private data. All the steps in '1' can apply to all arrays and so can be performed in common code. This is useful as in the case where we change the personality which manages an array (in level_store()), it would be helpful to do step 1 early, and step 2 later. So split the 'step 1' functionality out into a new mddev_detach(). Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: make merge_bvec_fn more robust in face of personality changes.NeilBrown
There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: make ->congested robust against personality changes.NeilBrown
There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14md: remove unwanted white space from md.cNeilBrown
My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08md/raid0: check for bitmap compatability when changing raid levels.NeilBrown
If an array has a bitmap, then it cannot be converted to raid0. Reported-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>