summaryrefslogtreecommitdiffstats
path: root/drivers/md/raid5.c
AgeCommit message (Collapse)Author
2016-09-09raid5: fix a small race conditionShaohua Li
commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data) moves bio_reset to bio_endio. But it introduces a small race condition. It does bio_reset after raid5_release_stripe, which could make the stripe reusable and hence reuse the bio just before bio_reset. Moving bio_reset before raid5_release_stripe is called should fix the race. Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-31raid5: guarantee enough stripes to avoid reshape hangShaohua Li
If there aren't enough stripes, reshape will hang. We have a check for this in new reshape, but miss it for reshape resume, hence we could see hang in reshape resume. This patch forces enough stripes existed if reshape resumes. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-30Merge tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: "This includes several bug fixes: - Alexey Obitotskiy fixed a hang for faulty raid5 array with external management - Song Liu fixed two raid5 journal related bugs - Tomasz Majchrzak fixed a bad block recording issue and an accounting issue for raid10 - ZhengYuan Liu fixed an accounting issue for raid5 - I fixed a potential race condition and memory leak with DIF/DIX enabled - other trival fixes" * tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: avoid unnecessary bio data set raid5: fix memory leak of bio integrity data raid10: record correct address of bad block md-cluster: fix error return code in join() r5cache: set MD_JOURNAL_CLEAN correctly md: don't print the same repeated messages about delayed sync operation md: remove obsolete ret in md_start_sync md: do not count journal as spare in GET_ARRAY_INFO md: Prevent IO hold during accessing to faulty raid5 array MD: hold mddev lock to change bitmap location raid5: fix incorrectly counter of conf->empty_inactive_list_nr raid10: increment write counter after bio is split
2016-08-24raid5: avoid unnecessary bio data setShaohua Li
bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to set them every time. bi_private will be set before the bio is dispatched. Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-24raid5: fix memory leak of bio integrity dataShaohua Li
Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5 doesn't alloc/free bio, instead it reuses bios. There are two issues in current code: 1. the code calls bio_init (from init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io). The bio is reused, so likely there is integrity data attached. bio_init will clear a pointer to integrity data and makes bio_reset can't release the data 2. bio_reset is called before dispatching bio. After bio is finished, it's possible we don't free bio's integrity data (eg, we don't call bio_reset again) Both issues will cause memory leak. The patch moves bio_init to stripe creation and bio_reset to bio end io. This will fix the two issues. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-24r5cache: set MD_JOURNAL_CLEAN correctlySong Liu
Currently, the code sets MD_JOURNAL_CLEAN when the array has MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array will be MD_JOURNAL_CLEAN even if the journal device is missing. With this patch, the MD_JOURNAL_CLEAN is only set when the journal device presents. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-05md: Prevent IO hold during accessing to faulty raid5 arrayAlexey Obitotskiy
After array enters in faulty state (e.g. number of failed drives becomes more then accepted for raid5 level) it sets error flags (one of this flags is MD_CHANGE_PENDING). For internal metadata arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for external metadata arrays. MD_CHANGE_PENDING flag set prevents to finish all new or non-finished IOs to array and hold them in pending state. In some cases this can leads to deadlock situation. For example, we have faulty array (2 of 4 drives failed) and udev handle array state changes and blkid started (or other userspace application that used array to read/write) but unable to finish reads due to IO hold. At the same time we unable to get exclusive access to array (to stop array in our case) because another external application still use this array. Fix makes possible to return IO with errors immediately. So external application can finish working with array and give exclusive access to other applications to perform required management actions with array. Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-01raid5: fix incorrectly counter of conf->empty_inactive_list_nrZhengYuan Liu
The counter conf->empty_inactive_list_nr is only used for determine if the raid5 is congested which is deal with in function raid5_congested(). It was increased in get_free_stripe() when conf->inactive_list got to be empty and decreased in release_inactive_stripe_list() when splice temp_inactive_list to conf->inactive_list. However, this may have a problem when raid5_get_active_stripe or stripe_add_to_batch_list was called, because these two functions may call list_del_init(&sh->lru) to delete sh from "conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be done at these two point and increase empty_inactive_list_nr accordingly. Otherwise the counter may get to be negative number which would influence async readahead from VFS. Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-28Merge branch 'mymd/for-next' into mymd/for-linusShaohua Li
2016-07-20block: get rid of bio_rw and READAChristoph Hellwig
These two are confusing leftover of the old world order, combining values of the REQ_OP_ and REQ_ namespaces. For callers that don't special case we mostly just replace bi_rw with bio_data_dir or op_is_write, except for the few cases where a switch over the REQ_OP_ values makes more sense. Any check for READA is replaced with an explicit check for REQ_RAHEAD. Also remove the READA alias for REQ_RAHEAD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13md: reduce the number of synchronize_rcu() calls when multiple devices fail.NeilBrown
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made which can delay several milliseconds in some case. If lots of devices fail at once - as could happen with a large RAID10 where one set of devices are removed all at once - these delays can add up to be very inconcenient. As failure is not reversible we can check for that first, setting a separate flag if it is found, and then all synchronize_rcu() once for all the flagged devices. Then ->hot_remove_disk() function can skip the synchronize_rcu() step if the flag is set. fix build error(Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13md: be extra careful not to take a reference to a Faulty device.NeilBrown
It is important that we never increment rdev->nr_pending on a Faulty device as ->hot_remove_disk() assumes that once the Faulty flag is visible no code will take a new reference. Some places take a new reference after only check In_sync. This should be safe as the two are changed together. However to make the code more obviously safe, add checks for 'Faulty' as well. Note: the actual rule is: Never increment nr_pending if Faulty is set and Blocked is clear, never clear Faulty, and never set Blocked without holding a reference through nr_pending. fix build error (Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13md/raid5: add rcu protection to rdev accesses in raid5_status.NeilBrown
Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13md/raid5: add rcu protection to rdev accesses in want_replaceNeilBrown
Being in the middle of resync is no longer protection against failed rdevs disappearing. So add rcu protection. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13md/raid5: add rcu protection to rdev accesses in handle_failed_sync.NeilBrown
The rdev could be freed while handle_failed_sync is running, so rcu protection is needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: shrink bi_rw from long to intMike Christie
We don't need bi_rw to be so large on 64 bit archs, so reduce it to unsigned int. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07md: use bio op accessorsMike Christie
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-25right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMWSong Liu
In current handle_stripe_dirtying, the code prefers rmw with PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW. This patch reverses this behavior. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09md: set MD_CHANGE_PENDING in a atomic regionGuoqing Jiang
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09md: raid5: add prerequisite to run underneath dm-raidHeinz Mauelshagen
In case md runs underneath the dm-raid target, the mddev does not have a request queue or gendisk, thus avoid accesses. This patch adds a missing conditional to the raid5 personality. Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-29raid5: delete unnecessary warnningShaohua Li
If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page != orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the SkipCopy flag and set page to orig_page. So the warning is unnecessary. Reported-by: Joey Liao <joeyliao@qnap.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17md/raid5: Cleanup cpu hotplug notifierAnna-Maria Gleixner
The raid456_cpu_notify() hotplug callback lacks handling of the CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch buffer is leaked. Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions to free the scratch buffer. CC: Shaohua Li <shli@kernel.org> CC: linux-raid@vger.kernel.org Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09md/raid5: output stripe state for debugShaohua Li
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be quite convenient if we know the stripe state. This is what this patch does. Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_listNeilBrown
break_stripe_batch_list breaks up a batch and copies some flags from the batch head to the members, preserving others. It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a stripe_head is added to a batch, and is not set on stripe_heads already in a batch. However there is no locking to ensure one thread doesn't set the flag after it has just been cleared in another. This does occasionally happen. md/raid5 maintains a count of the number of stripe_heads with STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently this could becomes incorrect and will never again return to zero. md/raid5 delays the handling of some stripe_heads until preread_active_stripes becomes zero. So when the above mention race happens, those stripe_heads become blocked and never progress, resulting is write to the array handing. So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE in the members of a batch. URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741 URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153 URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz Reported-by: Martin Svec <martin.svec@zoner.cz> (and others) Tested-by: Tom Weber <linux@junkyard.4t2.com> Fixes: 1b956f7a8f9a ("md/raid5: be more selective about distributing flags across batch.") Cc: stable@vger.kernel.org (v4.1 and later) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26RAID5: revert e9e4c377e2f563 to fix a livelockShaohua Li
Revert commit e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe) The problem is raid5_get_active_stripe waits on conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes in this order: - release all stripes with hash 0 - raid5_get_active_stripe still sleeps since active_stripes > max_nr_stripes * 3 / 4 - release all stripes with hash other than 0. active_stripes becomes 0 - raid5_get_active_stripe still sleeps, since nobody wakes up wait_for_stripe[0] The system live locks. The problem is active_stripes isn't a per-hash count. Revert the patch makes the live lock go away. Cc: stable@vger.kernel.org (v4.2+) Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26RAID5: check_reshape() shouldn't call mddev_suspendShaohua Li
check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by 738a273 ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-25md/raid5: Compare apples to apples (or sectors to sectors)Jes Sorensen
'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: 620125f2bf8 ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-20MD: rename some functionsShaohua Li
These short function names are hard to search. Rename them to make vim happy. Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-06raid5-cache: add journal hot add/remove supportShaohua Li
Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md/raid5: remove redundant check in stripe_add_to_batch_list()Roman Gushchin
The stripe_add_to_batch_list() function is called only if stripe_can_batch() returned true, so there is no need for double check. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-04Merge tag 'md/4.4' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: "Two major components to this update. 1) The clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2) The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. * tag 'md/4.4' of git://neil.brown.name/md: (66 commits) MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW MD: set journal disk ->raid_disk MD: kick out journal disk if it's not fresh raid5-cache: start raid5 readonly if journal is missing MD: add new bit to indicate raid array with journal raid5-cache: IO error handling raid5: journal disk can't be removed raid5-cache: add trim support for log MD: fix info output for journal disk raid5-cache: use bio chaining raid5-cache: small log->seq cleanup raid5-cache: new helper: r5_reserve_log_entry raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta raid5-cache: take rdev->data_offset into account early on raid5-cache: refactor bio allocation raid5-cache: clean up r5l_get_meta raid5-cache: simplify state machine when caches flushes are not needed raid5-cache: factor out a helper to run all stripes for an I/O unit raid5-cache: rename flushed_ios to finished_ios raid5-cache: free I/O units earlier ...
2015-11-01MD: set journal disk ->raid_diskShaohua Li
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of 0, because we already have a disk with ->raid_disk 0 and this causes sysfs entry creation conflict. A lot of places assumes disk with ->raid_disk >=0 is normal raid disk, so we add check for journal disk. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: start raid5 readonly if journal is missingShaohua Li
If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: IO error handlingShaohua Li
There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5: journal disk can't be removedShaohua Li
raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places. Don't allow journal disk disappear magically. On the other hand, we do need to update superblock for other disks to bump up ->events, so next time journal disk will be identified as stale. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: move reclaim stop to quiesceShaohua Li
Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: optimize FLUSH IO with log enabledShaohua Li
With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: switching to state machine for log disk cache flushShaohua Li
Before we write stripe data to raid disks, we must guarantee stripe data is settled down in log disk. To do this, we flush log disk cache and wait the flush finish. That wait introduces sleep time in raid5d thread and impact performance. This patch moves the log disk cache flush process to the stripe handling state machine, which can remove the wait in raid5d. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5: enable log for raid array with cache diskShaohua Li
Now log is safe to enable for raid array with cache disk Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5: don't allow resize/reshape with cache(log) supportShaohua Li
If cache(log) support is enabled, don't allow resize/reshape in current stage. In the future, we can flush all data from cache(log) to raid before resize/reshape and then allow resize/reshape. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5: disable batch with log enabledShaohua Li
With log enabled, r5l_write_stripe will add the stripe to log. With batch, several stripes are linked together. The stripes must be in the same state. While with log, the log/reclaim unit is stripe, we can't guarantee the several stripes are in the same state. Disabling batch for log now. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-31md/raid5: fix locking in handle_stripe_clean_event()Roman Gushchin
After commit 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()") __find_stripe() is called under conf->hash_locks + hash. But handle_stripe_clean_event() calls remove_hash() under conf->device_lock. Under some cirscumstances the hash chain can be circuited, and we get an infinite loop with disabled interrupts and locked hash lock in __find_stripe(). This leads to hard lockup on multiple CPUs and following system crash. I was able to reproduce this behavior on raid6 over 6 ssd disks. The devices_handle_discard_safely option should be set to enable trim support. The following script was used: for i in `seq 1 32`; do dd if=/dev/zero of=large$i bs=10M count=100 & done neilb: original was against a 3.x kernel. I forward-ported to 4.3-rc. This verison is suitable for any kernel since Commit: 59fc630b8b5f ("RAID5: batch adjacent full stripe write") (v4.1+). I'll post a version for earlier kernels to stable. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Fixes: 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()") Signed-off-by: NeilBrown <neilb@suse.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> # 3.13 - 4.2
2015-10-24raid5: log reclaim supportShaohua Li
This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24raid5: add basic stripe logShaohua Li
This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24raid5: add a new state for stripe log handlingShaohua Li
When a stripe finishes construction, we write the stripe to raid in ops_run_io normally. With log, we do a bunch of other operations before the stripe is written to raid. Mainly write the stripe to log disk, flush disk cache and so on. The operations are still driven by raid5d and run in the stripe state machine. We introduce a new state for such stripe (trapped into log). The stripe is in this state from the time it first enters ops_run_io (finish construction) to the time it is written to raid. Since we know the state is only for log, we bypass other check/operation in handle_stripe. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24raid5: export some functionsShaohua Li
Next several patches use some raid5 functions, rename them with raid5 prefix and export out. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-12md-cluster: Use a small window for resyncGoldwyn Rodrigues
Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-02md: drop null test before destroy functionsJulia Lawall
Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>