aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c
AgeCommit message (Collapse)Author
2017-11-14Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD update from Shaohua Li: "This update mostly includes bug fixes: - md-cluster now supports raid10 from Guoqing - raid5 PPL fixes from Artur - badblock regression fix from Bo - suspend hang related fixes from Neil - raid5 reshape fixes from Neil - raid1 freeze deadlock fix from Nate - memleak fixes from Zdenek - bitmap related fixes from Me and Tao - other fixes and cleanups" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits) md: free unused memory after bitmap resize md: release allocated bitset sync_set md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb md: be cautious about using ->curr_resync_completed for ->recovery_offset badblocks: fix wrong return value in badblocks_set if badblocks are disabled md: don't check MD_SB_CHANGE_CLEAN in md_allow_write md-cluster: update document for raid10 md: remove redundant variable q raid1: remove obsolete code in raid1_write_request md-cluster: Use a small window for raid10 resync md-cluster: Suspend writes in RAID10 if within range md-cluster/raid10: set "do_balance = 0" if area is resyncing md: use lockdep_assert_held raid1: prevent freeze_array/wait_all_barriers deadlock md: use TASK_IDLE instead of blocking signals md: remove special meaning of ->quiesce(.., 2) md: allow metadata update while suspending. md: use mddev_suspend/resume instead of ->quiesce() md: move suspend_hi/lo handling into core md code md: don't call bitmap_create() while array is quiesced. ...
2017-11-10md: release allocated bitset sync_setZdenek Kabelac
Patch fixes kmemleak on md_stop() path used likely only by dm-raid wrapper. Code of md is using mddev_put() where both bitsets are released however this freeing is not shared. Also set NULL to bio_set and sync_set pointers just like mddev_put is doing. Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-09md: be cautious about using ->curr_resync_completed for ->recovery_offsetNeilBrown
The ->recovery_offset shows how much of a non-InSync device is actually in sync - how much has been recoveryed. When performing a recovery, ->curr_resync and ->curr_resync_completed follow the device address being recovered and so can be used to update ->recovery_offset. When performing a reshape, ->curr_resync* might follow the device addresses (raid5) or might follow array addresses (raid10), so cannot in general be used to set ->recovery_offset. When reshaping backwards, ->curre_resync* measures from the *end* of the array-or-device, so is particularly unhelpful. So change the common code in md.c to only use ->curr_resync_complete for the simple recovery case, and add code to raid5.c to update ->recovery_offset during a forwards reshape. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: don't check MD_SB_CHANGE_CLEAN in md_allow_writeArtur Paszkiewicz
Only MD_SB_CHANGE_PENDING should be used to wait for transition from clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can race with e.g. md_do_sync(). This sporadically causes a hang when changing consistency policy during resync: INFO: task mdadm:6183 blocked for more than 30 seconds. Not tainted 4.14.0-rc3+ #391 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdadm D12752 6183 6022 0x00000000 Call Trace: __schedule+0x93f/0x990 schedule+0x6b/0x90 md_allow_write+0x100/0x130 [md_mod] ? do_wait_intr_irq+0x90/0x90 resize_stripes+0x3a/0x5b0 [raid456] ? kernfs_fop_write+0xbe/0x180 raid5_change_consistency_policy+0xa6/0x200 [raid456] consistency_policy_store+0x2e/0x70 [md_mod] md_attr_store+0x90/0xc0 [md_mod] sysfs_kf_write+0x42/0x50 kernfs_fop_write+0x119/0x180 __vfs_write+0x28/0x110 ? rcu_sync_lockdep_assert+0x12/0x60 ? __sb_start_write+0x15a/0x1c0 ? vfs_write+0xa3/0x1a0 vfs_write+0xb4/0x1a0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x18/0xad Fixes: 2214c260c72b ("md: don't return -EAGAIN in md_allow_write for external metadata arrays") Cc: <stable@vger.kernel.org> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: use lockdep_assert_heldShaohua Li
lockdep_assert_held is a better way to assert lock held, and it works for UP. Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: remove special meaning of ->quiesce(.., 2)NeilBrown
The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: allow metadata update while suspending.NeilBrown
There are various deadlocks that can occur when a thread holds reconfig_mutex and calls ->quiesce(mddev, 1). As some write request block waiting for metadata to be updated (e.g. to record device failure), and as the md thread updates the metadata while the reconfig mutex is held, holding the mutex can stop write requests completing, and this prevents ->quiesce(mddev, 1) from completing. ->quiesce() is now usually called from mddev_suspend(), and it is always called with reconfig_mutex held. So at this time it is safe for the thread to update metadata without explicitly taking the lock. So add 2 new flags, one which says the unlocked updates is allowed, and one which ways it is happening. Then allow it while the quiesce completes, and then wait for it to finish. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: use mddev_suspend/resume instead of ->quiesce()NeilBrown
mddev_suspend() is a more general interface than calling ->quiesce() and is so more extensible. A future patch will make use of this. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: move suspend_hi/lo handling into core md codeNeilBrown
responding to ->suspend_lo and ->suspend_hi is similar to responding to ->suspended. It is best to wait in the common core code without incrementing ->active_io. This allows mddev_suspend()/mddev_resume() to work while requests are waiting for suspend_lo/hi to change. This is will be important after a subsequent patch which uses mddev_suspend() to synchronize updating for suspend_lo/hi. So move the code for testing suspend_lo/hi out of raid1.c and raid5.c, and place it in md.c Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: don't call bitmap_create() while array is quiesced.NeilBrown
bitmap_create() allocates memory with GFP_KERNEL and so can wait for IO. If called while the array is quiesced, it could wait indefinitely for write out to the array - deadlock. So call bitmap_create() before quiescing the array. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: always hold reconfig_mutex when calling mddev_suspend()NeilBrown
Most often mddev_suspend() is called with reconfig_mutex held. Make this a requirement in preparation a subsequent patch. Also require reconfig_mutex to be held for mddev_resume(), partly for symmetry and partly to guarantee no races with incr/decr of mddev->suspend. Taking the mutex in r5c_disable_writeback_async() is a little tricky as this is called from a work queue via log->disable_writeback_work, and flush_work() is called on that while holding ->reconfig_mutex. If the work item hasn't run before flush_work() is called, the work function will not be able to get the mutex. So we use mddev_trylock() inside the wait_event() call, and have that abort when conf->log is set to NULL, which happens before flush_work() is called. We wait in mddev->sb_wait and ensure this is woken when any of the conditions change. This requires waking mddev->sb_wait in mddev_unlock(). This is only like to trigger extra wake_ups of threads that needn't be woken when metadata is being written, and that doesn't happen often enough that the cost would be noticeable. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01md: forbid a RAID5 from having both a bitmap and a journal.NeilBrown
Having both a bitmap and a journal is pointless. Attempting to do so can corrupt the bitmap if the journal replay happens before the bitmap is initialized. Rather than try to avoid this corruption, simply refuse to allow arrays with both a bitmap and a journal. So: - if raid5_run sees both are present, fail. - if adding a bitmap finds a journal is present, fail - if adding a journal finds a bitmap is present, fail. Cc: stable@vger.kernel.org (4.10+) Signed-off-by: NeilBrown <neilb@suse.com> Tested-by: Joshua Kinard <kumba@gentoo.org> Acked-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-31treewide: Fix function prototypes for module_param_call()Kees Cook
Several function prototypes for the set/get functions defined by module_param_call() have a slightly wrong argument types. This fixes those in an effort to clean up the calls when running under type-enforced compiler instrumentation for CFI. This is the result of running the following semantic patch: @match_module_param_call_function@ declarer name module_param_call; identifier _name, _set_func, _get_func; expression _arg, _mode; @@ module_param_call(_name, _set_func, _get_func, _arg, _mode); @fix_set_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._set_func; identifier _val, _param; type _val_type, _param_type; @@ int _set_func( -_val_type _val +const char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } @fix_get_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._get_func; identifier _val, _param; type _val_type, _param_type; @@ int _get_func( -_val_type _val +char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } Two additional by-hand changes are included for places where the above Coccinelle script didn't notice them: drivers/platform/x86/thinkpad_acpi.c fs/lockd/svc.c Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-16md: rename some drivers/md/ files to have an "md-" prefixMike Snitzer
Motivated by the desire to illiminate the imprecise nature of DM-specific patches being unnecessarily sent to both the MD maintainer and mailing-list. Which is born out of the fact that DM files also reside in drivers/md/ Now all MD-specific files in drivers/md/ start with either "raid" or "md-" and the MAINTAINERS file has been updated accordingly. Shaohua: don't change module name Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-08md: always set THREAD_WAKEUP and wake up wqueue if thread existedGuoqing Jiang
Since commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), the wait_queue is only got invoked if THREAD_WAKEUP is not set previously. With above change, I can see process_metadata_update could always hang on the wait queue, because mddev->thread could stay on 'D' status and the THREAD_WAKEUP flag is not cleared since there are lots of place to wake up mddev->thread. Then deadlock happened as follows: linux175:~ # ps aux|grep md|grep D root 20117 0.0 0.0 0 0 ? D 03:45 0:00 [md0_raid1] root 20125 0.0 0.0 0 0 ? D 03:45 0:00 [md0_cluster_rec] linux175:~ # cat /proc/20117/stack [<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster] [<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster] [<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster] [<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod] [<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod] [<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod] [<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 linux175:~ # cat /proc/20125/stack [<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 So let's revert the part of code in the commit to resovle the problem since we can't get lots of benefits of previous change. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-05md: fix deadlock error in recent patch.NeilBrown
A recent patch aimed to cause md_write_start() to fail (rather than block) when the mddev was suspending, so as to avoid deadlocks. Unfortunately the test in wait_event() was wrong, and it didn't change behaviour at all. We wait_event() must wait until the metadata is written OR the array is suspending. Fixes: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") Cc: stable@vger.kernel.org Reported-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-09-27md: fix a race condition for flush request handlingShaohua Li
md_submit_flush_data calls pers->make_request, which missed the suspend check. Fix it with the new md_handle_request API. Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Nate Dailey <nate.dailey@stratus.com> Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start()) Cc: stable@vger.kernel.org Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-09-27md: separate request handlingShaohua Li
With commit cc27b0c78c79, pers->make_request could bail out without handling the bio. If that happens, we should retry. The commit fixes md_make_request but not other call sites. Separate the request handling part, so other call sites can use it. Reported-by: Nate Dailey <nate.dailey@stratus.com> Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start()) Cc: stable@vger.kernel.org Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-09-07Merge tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD updates from Shaohua Li: "This update mainly fixes bugs: - Make raid5 ppl support several ppl from Pawel - Several raid5-cache bug fixes from Song - Bitmap fixes from Neil and Me - One raid1/10 regression fix since 4.12 from Me - Other small fixes and cleanup" * tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/bitmap: disable bitmap_resize for file-backed bitmaps. raid5-ppl: Recovery support for multiple partial parity logs md: Runtime support for multiple ppls md/raid0: attach correct cgroup info in bio lib/raid6: align AVX512 constants to 512 bits, not bytes raid5: remove raid5_build_block md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show md: replace seq_release_private with seq_release md: notify about new spare disk in the container md/raid1/10: reset bio allocated from mempool md/raid5: release/flush io in raid5_do_work() md/bitmap: copy correct data for bitmap super
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-08-28md: Runtime support for multiple pplsPawel Baldysiak
Increase PPL area to 1MB and use it as circular buffer to store PPL. The entry with highest generation number is the latest one. If PPL to be written is larger then space left in a buffer, rewind the buffer to the start (don't wrap it). Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md: replace seq_release_private with seq_releaseCihangir Akturk
Since commit f15146380d28 ("fs: seq_file - add event counter to simplify poll() support"), md.c code has been no longer used the private field of the struct seq_file, but seq_release_private() has been continued to be used to release the allocated seq_file. This seems to have been forgotten. So convert it to use seq_release() instead of seq_release_private(). Signed-off-by: Cihangir Akturk <cakturk@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md: notify about new spare disk in the containerAlexey Obitotskiy
In case of external metadata arrays spare disks are added to containers first. mdadm keeps monitoring /proc/mdstat output and when spare disk is available, it moves it from the container to the array. The problem is there is no notification of new spare disk in the container and mdadm waits a long time (until timeout) before it takes the action. Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-11MD: not clear ->safemode for external metadata arrayShaohua Li
->safemode should be triggered by mdadm for external metadaa array, otherwise array's state confuses mdadm. Fixes: 33182d15c6bf(md: always clear ->safemode when md_check_recovery gets the mddev lock.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08md: fix test in md_write_start()NeilBrown
md_write_start() needs to clear the in_sync flag is it is set, or if there might be a race with set_in_sync() such that the later will set it very soon. In the later case it is sufficient to take the spinlock to synchronize with set_in_sync(), and then set the flag if needed. The current test is incorrect. It should be: if "flag is set" or "race is possible" "flag is set" is trivially "mddev->in_sync". "race is possible" should be tested by "mddev->sync_checkers". If sync_checkers is 0, then there can be no race. set_in_sync() will wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period, and as md_write_start() holds the rcu_read_lock(), set_in_sync() will be sure ot see the update to writes_pending. If sync_checkers is > 0, there could be race. If md_write_start() happened entirely between if (!mddev->in_sync && percpu_ref_is_zero(&mddev->writes_pending)) { and mddev->in_sync = 1; in set_in_sync(), then it would not see that is_sync had been set, and set_in_sync() would not see that writes_pending had been incremented. This bug means that in_sync is sometimes not set when it should be. Consequently there is a small chance that the array will be marked as "clean" when in fact it is inconsistent. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") cc: stable@vger.kernel.org (v4.12+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08md: always clear ->safemode when md_check_recovery gets the mddev lock.NeilBrown
If ->safemode == 1, md_check_recovery() will try to get the mddev lock and perform various other checks. If mddev->in_sync is zero, it will call set_in_sync, and clear ->safemode. However if mddev->in_sync is not zero, ->safemode will not be cleared. When md_check_recovery() drops the mddev lock, the thread is woken up again. Normally it would just check if there was anything else to do, find nothing, and go to sleep. However as ->safemode was not cleared, it will take the mddev lock again, then wake itself up when unlocking. This results in an infinite loop, repeatedly calling md_check_recovery(), which RCU or the soft-lockup detector will eventually complain about. Prior to commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), safemode would only be set to one when the writes_pending counter reached zero, and would be cleared again when writes_pending is incremented. Since that patch, safemode is set more freely, but is not reliably cleared. So in md_check_recovery() clear ->safemode before checking ->in_sync. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Cc: stable@vger.kernel.org (4.12+) Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Reported-by: David R <david@unsolicited.net> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-25MD: fix warnning for UP caseShaohua Li
spin_is_locked always returns 0 for UP case, so ignores it Reported-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-08Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD update from Shaohua Li: - fixed deadlock in MD suspend and a potential bug in bio allocation (Neil Brown) - fixed signal issue (Mikulas Patocka) - fixed typo in FailFast test (Guoqing Jiang) - other trival fixes * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: fix sleep in atomic MD: fix a null dereference md: use a separate bio_set for synchronous IO. md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE md/raid1: remove unused bio in sync_request_write md/raid10: fix FailFast test for wrong device md: don't use flush_signals in userspace processes md: fix deadlock between mddev_suspend() and md_write_start()
2017-07-03MD: fix sleep in atomicShaohua Li
bioset_free() will take a mutex, so can't get called with spinlock hold. Fix: 5a85071c2cbc(md: use a separate bio_set for synchronous IO.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-23MD: fix a null dereferenceShaohua Li
rdev->mddev could be null in start time. Reported-by: Ming Lei <ming.lei@redhat.com> Fix: 5a85071c2cbc(md: use a separate bio_set for synchronous IO.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-21md: use a separate bio_set for synchronous IO.NeilBrown
md devices allocate a bio_set and use it for two distinct purposes. mddev->bio_set is used to clone bios as part of sending upper level requests down to lower level devices, and it is also use for synchronous IO such as superblock and bitmap updates, and for correcting read errors. This multiple usage can lead to deadlocks. It is likely that cloned bios might be queued for write and to be waiting for a metadata update before the write can be permitted. If the cloning exhausted mddev->bio_set, the metadata update may not be able to proceed. This scenario has been seen during heavy testing, with lots of IO and lots of memory pressure. Address this by adding a new bio_set specifically for synchronous IO. All synchronous IO goes directly to the underlying device and is not queued at the md level, so request using entries from the new mddev->sync_set will complete in a timely fashion. Requests that use mddev->bio_set will sometimes need to wait for synchronous IO, but will no longer risk deadlocking that iO. Also: small simplification in mddev_put(): there is no need to wait until the spinlock is released before calling bioset_free(). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-18block: remove bio_clone() and all references.NeilBrown
bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk: replace bioset_create_nobvec() with a flags arg to bioset_create()NeilBrown
"flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk: remove bio_set arg from blk_queue_split()NeilBrown
blk_queue_split() is always called with the last arg being q->bio_split, where 'q' is the first arg. Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses q->bio_split. This is inconsistent and unnecessary. Remove the last arg and always use q->bio_split inside blk_queue_split() Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-16md: change the initialization value for a spare device spot to ↵Lidong Zhong
MD_DISK_ROLE_SPARE The value for spare spot of sb->dev_roles is changed from MD_DISK_ROLE_FAULTY to MD_DISK_ROLE_SPARE to keep align with the value when the superblock is firstly created in userspace. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-13md: fix deadlock between mddev_suspend() and md_write_start()NeilBrown
If mddev_suspend() races with md_write_start() we can deadlock with mddev_suspend() waiting for the request that is currently in md_write_start() to complete the ->make_request() call, and md_write_start() waiting for the metadata to be updated to mark the array as 'dirty'. As metadata updates done by md_check_recovery() only happen then the mddev_lock() can be claimed, and as mddev_suspend() is often called with the lock held, these threads wait indefinitely for each other. We fix this by having md_write_start() abort if mddev_suspend() is happening, and ->make_request() aborts if md_write_start() aborted. md_make_request() can detect this abort, decrease the ->active_io count, and wait for mddev_suspend(). Reported-by: Nix <nix@esperi.org.uk> Fix: 68866e425be2(MD: no sync IO while suspended) Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-13Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into ↵Christoph Hellwig
nvme-base
2017-06-12Merge tag 'v4.12-rc5' into for-4.13/blockJens Axboe
We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-09block: switch bios to blk_status_tChristoph Hellwig
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-05md: initialise ->writes_pending in personality modules.NeilBrown
The new per-cpu counter for writes_pending is initialised in md_alloc(), which is not called by dm-raid. So dm-raid fails when md_write_start() is called. Move the initialization to the personality modules that need it. This way it is always initialised when needed, but isn't unnecessarily initialized (requiring memory allocation) when the personality doesn't use writes_pending. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-05md: namespace private helper namesAmir Goldstein
The md private helper uuid_equal() collides with a generic helper of the same name. Rename the md private helper to md_uuid_equal() and do the same for md_sb_equal(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-05-31md: Make flush bios explicitely syncJan Kara
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: linux-raid@vger.kernel.org CC: Shaohua Li <shli@kernel.org> Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-08md: don't return -EAGAIN in md_allow_write for external metadata arraysArtur Paszkiewicz
This essentially reverts commit b5470dc5fc18 ("md: resolve external metadata handling deadlock in md_allow_write") with some adjustments. Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") changing array_state to 'active' does not use mddev_lock() and will not cause a deadlock with md_allow_write(). This revert simplifies userspace tools that write to sysfs attributes like "stripe_cache_size" or "consistency_policy" because it removes the need for special handling for external metadata arrays, checking for EAGAIN and retrying the write. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-20md: handle read-only member devices better.NeilBrown
1/ If an array has any read-only devices when it is started, the array itself must be read-only 2/ A read-only device cannot be added to an array after it is started. 3/ Setting an array to read-write should not succeed if any member devices are read-only Reported-and-Tested-by: Nanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-12md: support disabling of create-on-open semantics.NeilBrown
md allows a new array device to be created by simply opening a device file. This make it difficult to remove the device and udev is likely to open the device file as part of processing the REMOVE event. There is an alternate mechanism for creating arrays by writing to the new_array module parameter. When using tools that work with this parameter, it is best to disable the old semantics. This new module parameter allows that. Signed-off-by: NeilBrown <neilb@suse.com> Acted-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-12md: allow creation of mdNNN arrays via md_mod/parameters/new_arrayNeilBrown
The intention when creating the "new_array" parameter and the possibility of having array names line "md_HOME" was to transition away from the old way of creating arrays and to eventually only use this new way. The "old" way of creating array is to create a device node in /dev and then open it. The act of opening creates the array. This is problematic because sometimes the device node can be opened when we don't want to create an array. This can easily happen when some rule triggered by udev looks at a device as it is being destroyed. The node in /dev continues to exist for a short period after an array is stopped, and opening it during this time recreates the array (as an inactive array). Unfortunately no clear plan for the transition was created. It is now time to fix that. This patch allows devices with numeric names, like "md999" to be created by writing to "new_array". This will only work if the minor number given is not already in use. This will allow mdadm to support the creation of arrays with numbers > 511 (currently not possible) by writing to new_array. mdadm can, at some point, use this approach to create *all* arrays, which will allow the transition to only using the new-way. Signed-off-by: NeilBrown <neilb@suse.com> Acted-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md.c:didn't unlock the mddev before return EINVAL in array_size_storeZhilong Liu
md.c: it needs to release the mddev lock before the array_size_store() returns. Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size are not supported") Signed-off-by: Zhilong Liu <zlliu@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stopNeilBrown
if called md_set_readonly and set MD_CLOSING bit, the mddev cannot be opened any more due to the MD_CLOING bit wasn't cleared. Thus it needs to be cleared in md_ioctl after any call to md_set_readonly() or do_md_stop(). Signed-off-by: NeilBrown <neilb@suse.com> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") Cc: stable@vger.kernel.org (v4.9+) Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>