summaryrefslogtreecommitdiffstats
path: root/drivers/md/dm-bufio.c
AgeCommit message (Collapse)Author
2020-06-05Merge tag 'for-5.8/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - The largest change for this cycle is the DM zoned target's metadata version 2 feature that adds support for pairing regular block devices with a zoned device to ease the performance impact associated with finite random zones of zoned device. The changes came in three batches: the first prepared for and then added the ability to pair a single regular block device, the second was a batch of fixes to improve zoned's reclaim heuristic, and the third removed the limitation of only adding a single additional regular block device to allow many devices. Testing has shown linear scaling as more devices are added. - Add new emulated block size (ebs) target that emulates a smaller logical_block_size than a block device supports The primary use-case is to emulate "512e" devices that have 512 byte logical_block_size and 4KB physical_block_size. This is useful to some legacy applications that otherwise wouldn't be able to be used on 4K devices because they depend on issuing IO in 512 byte granularity. - Add discard interfaces to DM bufio. First consumer of the interface is the dm-ebs target that makes heavy use of dm-bufio. - Fix DM crypt's block queue_limits stacking to not truncate logic_block_size. - Add Documentation for DM integrity's status line. - Switch DMDEBUG from a compile time config option to instead use dynamic debug via pr_debug. - Fix DM multipath target's hueristic for how it manages "queue_if_no_path" state internally. DM multipath now avoids disabling "queue_if_no_path" unless it is actually needed (e.g. in response to configure timeout or explicit "fail_if_no_path" message). This fixes reports of spurious -EIO being reported back to userspace application during fault tolerance testing with an NVMe backend. Added various dynamic DMDEBUG messages to assist with debugging queue_if_no_path in the future. - Add a new DM multipath "Historical Service Time" Path Selector. - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error. - Improve DM writecache target performance by using explicit cache flushing for target's single-threaded usecase and a small cleanup to remove unnecessary test in persistent_memory_claim. - Other small cleanups in DM core, dm-persistent-data, and DM integrity. * tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits) dm crypt: avoid truncating the logical block size dm mpath: add DM device name to Failing/Reinstating path log messages dm mpath: enhance queue_if_no_path debugging dm mpath: restrict queue_if_no_path state machine dm mpath: simplify __must_push_back dm zoned: check superblock location dm zoned: prefer full zones for reclaim dm zoned: select reclaim zone based on device index dm zoned: allocate zone by device index dm zoned: support arbitrary number of devices dm zoned: move random and sequential zones into struct dmz_dev dm zoned: per-device reclaim dm zoned: add metadata pointer to struct dmz_dev dm zoned: add device pointer to struct dm_zone dm zoned: allocate temporary superblock for tertiary devices dm zoned: convert to xarray dm zoned: add a 'reserved' zone flag dm zoned: improve logging messages for reclaim dm zoned: avoid unnecessary device recalulation for secondary superblock dm zoned: add debugging message for reading superblocks ...
2020-06-05dm bufio: introduce forget_buffer_lockedMikulas Patocka
Introduce a function forget_buffer_locked that forgets a range of buffers. It is more efficient than calling forget_buffer in a loop. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05dm bufio: clean up rbtree block orderingMikulas Patocka
dm-bufio uses unnatural ordering in the rb-tree - blocks with smaller numbers were put to the right node and blocks with bigger numbers were put to the left node. Reverse that logic so that it's natural. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-04dm bufio: delete unused and inefficient dm_bufio_discard_buffersMikulas Patocka
There is no user for this interface. If in future it is needed it can be reimplemented to walk the rbtree of buffers instead of doing block-by-block lookups. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-02mm: remove the pgprot argument to __vmallocChristoph Hellwig
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv] Acked-by: Gao Xiang <xiang@kernel.org> [erofs] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Wei Liu <wei.liu@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-15dm bufio: implement discardMikulas Patocka
Add functions dm_bufio_issue_discard and dm_bufio_discard_buffers. dm_bufio_issue_discard sends discard request to the underlying device. dm_bufio_discard_buffers frees buffers in the range and then calls dm_bufio_issue_discard. Also, factor out block_to_sector for reuse in dm_bufio_issue_discard. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: introduce a global cache replacementMikulas Patocka
This commit introduces a global cache replacement (instead of per-client cleanup). If one bufio client uses the cache heavily and another client is not using it, we want to let the first client use most of the cache. The old algorithm would partition the cache equally betwen the clients and that is sub-optimal. For cache replacement, we use the clock algorithm because it doesn't require taking any lock when the buffer is accessed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: remove old-style buffer cleanupMikulas Patocka
Remove code that cleans up buffers if the cache size grows over the limit. The next commit will introduce a new global cleanup. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: introduce a global queueMikulas Patocka
Rename param_spinlock to global_spinlock and introduce a global queue of all used buffers. The queue will be used in the following commits. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: refactor adjust_total_allocatedMikulas Patocka
Refactor adjust_total_allocated() so that it takes a bool argument indicating if it should add or subtract the buffer size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: call adjust_total_allocated from __link_buffer and __unlink_bufferMikulas Patocka
Move the call to adjust_total_allocated() to __link_buffer() and __unlink_buffer() so that only used buffers are counted. Reserved buffers are not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15Revert "dm bufio: fix deadlock with loop device"Mikulas Patocka
Revert the commit bd293d071ffe65e645b4d8104f9d8fe15ea13862. The proper fix has been made available with commit d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread"). Note that the fix offered by commit bd293d071ffe doesn't really prevent the deadlock from occuring - if we look at the stacktrace reported by Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex - i.e. it has already successfully taken the mutex. Changing the mutex from mutex_lock to mutex_trylock won't help with deadlocks that happen afterwards. PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: bd293d071ffe ("dm bufio: fix deadlock with loop device") Depends-on: d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12dm bufio: fix deadlock with loop deviceJunxiao Bi
When thin-volume is built on loop device, if available memory is low, the following deadlock can be triggered: One process P1 allocates memory with GFP_FS flag, direct alloc fails, memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan() runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer IO to complete in __try_evict_buffer(). But this IO may never complete if issued to an underlying loop device that forwards it using direct-IO, which allocates memory using GFP_KERNEL (see: do_blockdev_direct_IO()). If allocation fails, memory reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan() will be invoked, and since the mutex is already held by P1 the loop thread will hang, and IO will never complete. Resulting in ABBA deadlock. Cc: stable@vger.kernel.org Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-29dm bufio: Simplify stack trace retrievalThomas Gleixner
Replace the indirection through struct stack_trace with an invocation of the storage array based interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: linux-arch@vger.kernel.org Link: https://lkml.kernel.org/r/20190425094802.446326191@linutronix.de
2018-12-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - large KASAN update to use arm's "software tag-based mode" - a few misc things - sh updates - ocfs2 updates - just about all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits) kernel/fork.c: mark 'stack_vm_area' with __maybe_unused memcg, oom: notify on oom killer invocation from the charge path mm, swap: fix swapoff with KSM pages include/linux/gfp.h: fix typo mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization memory_hotplug: add missing newlines to debugging output mm: remove __hugepage_set_anon_rmap() include/linux/vmstat.h: remove unused page state adjustment macro mm/page_alloc.c: allow error injection mm: migrate: drop unused argument of migrate_page_move_mapping() blkdev: avoid migration stalls for blkdev pages mm: migrate: provide buffer_migrate_page_norefs() mm: migrate: move migrate_page_lock_buffers() mm: migrate: lock buffers before migrate_page_move_mapping() mm: migration: factor out code to compute expected number of page references mm, page_alloc: enable pcpu_drain with zone capability kmemleak: add config to select auto scan mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init ...
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-18dm bufio: update comment in dm-bufio.cShenghui Wang
* Hashtable has been replaced by rbtree to manage buffers. Update the comment. * Fix typo in the comment for dm_bufio_issue_flush Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30dm bufio: fix buffer alignmentMikulas Patocka
Commit 6b5e718cc138 ("dm bufio: relax alignment constraint on slab cache") relaxed alignment on dm-bufio cache, however it may break dm-crypt or dm-integrity. dm-crypt and dm-integrity require that the size of bio vector entries (bv_len) is aligned on its sector size. bv_offset doesn't have to be aligned, but bv_len must be. XFS sends unaligned bios, but they do not cross page boundary, so the requirement for aligned bv_len is met. Commit 6b5e718cc138 made dm-bufio send unaligned bios that cross page boundary, this could break dm-crypt and dm-integrity. Reinstates the alignment. Note that misaligned entries only happen when we use slab/slub debugging. Without debugging, the entries are always aligned. Fixes: 6b5e718cc138 ("dm bufio: relax alignment constraint on slab cache") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: don't embed a bio in the dm_buffer structureMikulas Patocka
The bio structure consumes a substantial part of dm_buffer. The bio structure is only needed when doing I/O on the buffer, thus we don't have to embed it in the buffer. Allocate the bio structure only when doing I/O. We don't need to create a bio_set because, in case of allocation failure, dm-bufio falls back to using dm-io (which keeps its own bio_set). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: support non-power-of-two block sizesMikulas Patocka
Support block sizes that are not a power-of-two (but they must be a multiple of 512b). As always, a slab cache is used for allocations. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: use slab cache for dm_buffer structure allocationsMikulas Patocka
kmalloc padded to the next power of two, using a slab cache avoids this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: reorder fields in dm_buffer structureMikulas Patocka
Reorder fields in dm_buffer structure to improve packing and reduce structure size. The compiler allocates 32-bit integer for field 'enum data_mode', so change it to unsigned char. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: relax alignment constraint on slab cacheMikulas Patocka
The I/O buffer doesn't have to be aligned on block size granularity, relax alignment to ARCH_KMALLOC_MINALIGN (required to allow DMA from slab cache memory on some architectures). Also, set SLAB_RECLAIM_ACCOUNT so that the memory allocated from the cache is accounted as reclaimable and doesn't inflate the 'used' entry in the free command. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: remove code that merges slab cachesMikulas Patocka
All slab allocators can merge duplicate caches. So dm-bufio doesn't need extra slab merging logic. Instead it can just allocate one slab cache per client and let the allocator merge them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: get rid of slab cache name allocationsMikulas Patocka
dm-bufio keeps the dm_bufio_cache_names array that holds names of the slab caches. Since the commit db265eca7700 ("mm/sl[aou]b: Move duping of slab name to slab_common.c"), the kernel automatically duplicates the slab cache name when creating the slab cache, so we no longer have to keep the name allocated. Remove the code that allocates the slab names and keeps them around. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: move dm-bufio.h to include/linux/Mikulas Patocka
Move dm-bufio.h to include/linux/ so that external GPL'd DM target modules can use it. It is better to allow the use of dm-bufio than force external modules to implement the equivalent buffered IO mechanism in some new way. The hope is this will encourage the use of dm-bufio; which will then make it easier for a GPL'd external DM target module to be included upstream. A couple dm-bufio EXPORT_SYMBOL exports have also been updated to use EXPORT_SYMBOL_GPL. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03dm bufio: delete outdated commentMikulas Patocka
This comment was true when dm-bufio was written but, since 4.3, bios can now have arbitrary size and the driver splits them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06dm bufio: avoid false-positive Wmaybe-uninitialized warningArnd Bergmann
gcc-6.3 and earlier show a new warning after a seemingly unrelated change to the arm64 PAGE_KERNEL definition: In file included from drivers/md/dm-bufio.c:14:0: drivers/md/dm-bufio.c: In function 'alloc_buffer': include/linux/sched/mm.h:182:56: warning: 'noio_flag' may be used uninitialized in this function [-Wmaybe-uninitialized] current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; ^ The same warning happened earlier on linux-3.18 for MIPS and I did a workaround for that, but now it's come back. gcc-7 and newer are apparently smart enough to figure this out, and other architectures don't show it, so the best I could come up with is to rework the caller slightly in a way that makes it obvious enough to all arm64 compilers what is happening here. Fixes: 41acec624087 ("arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()") Link: https://patchwork.kernel.org/patch/9692829/ Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> [snitzer: moved declarations inside conditional, altered vmalloc return] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17dm bufio: eliminate unnecessary labels in dm_bufio_client_create()Mike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17dm bufio: check result of register_shrinker()Aliaksei Karaliou
dm_bufio_client_create() does not check result of register_shrinker() which was tagged as __must_check recently, reported by sparse. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17dm bufio: add missed destroys of client mutexAliaksei Karaliou
The client's mutex needs to be destroyed in dm_bufio_client_destroy() as well as the dm_bufio_client_create() error path. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17dm bufio: use REQ_OP_READ and REQ_OP_WRITEMikulas Patocka
Use REQ_OP_READ and REQ_OP_WRITE macros instead of READ and WRITE. They have the same value, but the block layer uses REQ_OP so bufio should too. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08dm bufio: fix shrinker scans when (nr_to_scan < retain_target)Suren Baghdasaryan
When system is under memory pressure it is observed that dm bufio shrinker often reclaims only one buffer per scan. This change fixes the following two issues in dm bufio shrinker that cause this behavior: 1. ((nr_to_scan - freed) <= retain_target) condition is used to terminate slab scan process. This assumes that nr_to_scan is equal to the LRU size, which might not be correct because do_shrink_slab() in vmscan.c calculates nr_to_scan using multiple inputs. As a result when nr_to_scan is less than retain_target (64) the scan will terminate after the first iteration, effectively reclaiming one buffer per scan and making scans very inefficient. This hurts vmscan performance especially because mutex is acquired/released every time dm_bufio_shrink_scan() is called. New implementation uses ((LRU size - freed) <= retain_target) condition for scan termination. LRU size can be safely determined inside __scan() because this function is called after dm_bufio_lock(). 2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to determine number of freeable objects in the slab. However dm_bufio always retains retain_target buffers in its LRU and will terminate a scan when this mark is reached. Therefore returning the entire LRU size from dm_bufio_shrink_count() is misleading because that does not represent the number of freeable objects that slab will reclaim during a scan. Returning (LRU size - retain_target) better represents the number of freeable objects in the slab. This way do_shrink_slab() returns 0 when (LRU size < retain_target) and vmscan will not try to scan this shrinker avoiding scans that will not reclaim any memory. Test: tested using Android device running <AOSP>/system/extras/alloc-stress that generates memory pressure and causes intensive shrinker scans Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-11-17Merge tag 'for-4.15/dm-changes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull more device mapper updates from Mike Snitzer: "Given your expected travel I figured I'd get these fixes to you sooner rather than later. - a DM multipath stable@ fix to silence an annoying error message that isn't _really_ an error - a DM core @stable fix for discard support that was enabled for an entire DM device despite only having partial support for discards due to a mix of discard capabilities across the underlying devices. - a couple other DM core discard fixes. - a DM bufio @stable fix that resolves a 32-bit overflow" * tag 'for-4.15/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm bufio: fix integer overflow when limiting maximum cache size dm: clear all discard attributes in queue_limits when discards are disabled dm: do not set 'discards_supported' in targets that do not need it dm: discard support requires all targets in a table support discards dm mpath: remove annoying message of 'blk_get_request() returned -11'
2017-11-16dm bufio: fix integer overflow when limiting maximum cache sizeEric Biggers
The default max_cache_size_bytes for dm-bufio is meant to be the lesser of 25% of the size of the vmalloc area and 2% of the size of lowmem. However, on 32-bit systems the intermediate result in the expression (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100 overflows, causing the wrong result to be computed. For example, on a 32-bit system where the vmalloc area is 520093696 bytes, the result is 1174405 rather than the expected 130023424, which makes the maximum cache size much too small (far less than 2% of lowmem). This causes severe performance problems for dm-verity users on affected systems. Fix this by using mult_frac() to correctly multiply by a percentage. Do this for all places in dm-bufio that multiply by a percentage. Also replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary to the comment is now defined in include/linux/vmalloc.h. Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset") Fixes: 95d402f057f2 ("dm: add bufio") Cc: <stable@vger.kernel.org> # v3.2+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-14Merge tag 'for-4.14/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Some request-based DM core and DM multipath fixes and cleanups - Constify a few variables in DM core and DM integrity - Add bufio optimization and checksum failure accounting to DM integrity - Fix DM integrity to avoid checking integrity of failed reads - Fix DM integrity to use init_completion - A couple DM log-writes target fixes - Simplify DAX flushing by eliminating the unnecessary flush abstraction that was stood up for DM's use. * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dax: remove the pmem_dax_ops->flush abstraction dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK dm integrity: make blk_integrity_profile structure const dm integrity: do not check integrity for failed read operations dm log writes: fix >512b sectorsize support dm log writes: don't use all the cpu while waiting to log blocks dm ioctl: constify ioctl lookup table dm: constify argument arrays dm integrity: count and display checksum failures dm integrity: optimize writing dm-bufio buffers that are partially changed dm rq: do not update rq partially in each ending bio dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior dm mpath: complain about unsupported __multipath_map_bio() return values dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
2017-08-28dm integrity: optimize writing dm-bufio buffers that are partially changedMikulas Patocka
Rather than write the entire dm-bufio buffer when only a subset is changed, improve dm-bufio (and dm-integrity) by only writing the subset of the buffer that changed. Update dm-integrity to make use of dm-bufio's new dm_bufio_mark_partial_buffer_dirty() interface. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-25dm bufio: fix error code in dm_bufio_write_dirty_buffers()Dan Carpenter
We should be returning normal negative error codes here. The "a" variables comes from &c->async_write_error which is a blk_status_t converted to a regular error code. In the current code, the blk_status_t gets propogated back to pool_create() and eventually results in an Oops. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-12Merge tag 'v4.12-rc5' into for-4.13/blockJens Axboe
We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-09block: switch bios to blk_status_tChristoph Hellwig
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-31dm: make flush bios explicitly syncJan Kara
Commit b685d3d65ac7 ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous") removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions. Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. Fixes: b685d3d65ac7 ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous") Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-16dm bufio: make the parameter "retain_bytes" unsigned longMikulas Patocka
Change the type of the parameter "retain_bytes" from unsigned to unsigned long, so that on 64-bit machines the user can set more than 4GiB of data to be retained. Also, change the type of the variable "count" in the function "__evict_old_buffers" to unsigned long. The assignment "count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];" could result in unsigned long to unsigned overflow and that could result in buffers not being freed when they should. While at it, avoid division in get_retain_buffers(). Division is slow, we can change it to shift because we have precalculated the log2 of block size. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-08mm, vmalloc: use __GFP_HIGHMEM implicitlyMichal Hocko
__vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-01dm bufio: check new buffer allocation watermark every 30 secondsMikulas Patocka
dm-bufio checks a watermark when it allocates a new buffer in __bufio_new(). However, it doesn't check the watermark when the user changes /sys/module/dm_bufio/parameters/max_cache_size_bytes. This may result in a problem - if the watermark is high enough so that all possible buffers are allocated and if the user lowers the value of "max_cache_size_bytes", the watermark will never be checked against the new value because no new buffer would be allocated. To fix this, change __evict_old_buffers() so that it checks the watermark. __evict_old_buffers() is called every 30 seconds, so if the user reduces "max_cache_size_bytes", dm-bufio will react to this change within 30 seconds and decrease memory consumption. Depends-on: 1b0fb5a5b2 ("dm bufio: avoid a possible ABBA deadlock") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-01dm bufio: avoid a possible ABBA deadlockMikulas Patocka
__get_memory_limit() tests if dm_bufio_cache_size changed and calls __cache_size_refresh() if it did. It takes dm_bufio_clients_lock while it already holds the client lock. However, lock ordering is violated because in cleanup_old_buffers() dm_bufio_clients_lock is taken before the client lock. This results in a possible deadlock and lockdep engine warning. Fix this deadlock by changing mutex_lock() to mutex_trylock(). If the lock can't be taken, it will be re-checked next time when a new buffer is allocated. Also add "unlikely" to the if condition, so that the optimizer assumes that the condition is false. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-07dm bufio: add sector start offset to dm-bufio interfaceMikulas Patocka
Introduce dm_bufio_set_sector_offset() interface to allow setting a sector offset for a dm-bufio client. This is a prereq for the DM integrity target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-02sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h>Ingo Molnar
Update the .c files that depend on these APIs. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Remove set_task_state()Davidlohr Bueso
This is a nasty interface and setting the state of a foreign task must not be done. As of the following commit: be628be0956 ("bcache: Make gc wakeup sane, remove set_task_state()") ... everyone in the kernel calls set_task_state() with current, allowing the helper to be removed. However, as the comment indicates, it is still around for those archs where computing current is more expensive than using a pointer, at least in theory. An important arch that is affected is arm64, however this has been addressed now [1] and performance is up to par making no difference with either calls. Of all the callers, if any, it's the locking bits that would care most about this -- ie: we end up passing a tsk pointer to a lot of the lock slowpath, and setting ->state on that. The following numbers are based on two tests: a custom ad-hoc microbenchmark that just measures latencies (for ~65 million calls) between get_task_state() vs get_current_state(). Secondly for a higher overview, an unlink microbenchmark was used, which pounds on a single file with open, close,unlink combos with increasing thread counts (up to 4x ncpus). While the workload is quite unrealistic, it does contend a lot on the inode mutex or now rwsem. [1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com == 1. x86-64 == Avg runtime set_task_state(): 601 msecs Avg runtime set_current_state(): 552 msecs vanilla dirty Hmean unlink1-processes-2 36089.26 ( 0.00%) 38977.33 ( 8.00%) Hmean unlink1-processes-5 28555.01 ( 0.00%) 29832.55 ( 4.28%) Hmean unlink1-processes-8 37323.75 ( 0.00%) 44974.57 ( 20.50%) Hmean unlink1-processes-12 43571.88 ( 0.00%) 44283.01 ( 1.63%) Hmean unlink1-processes-21 34431.52 ( 0.00%) 38284.45 ( 11.19%) Hmean unlink1-processes-30 34813.26 ( 0.00%) 37975.17 ( 9.08%) Hmean unlink1-processes-48 37048.90 ( 0.00%) 39862.78 ( 7.59%) Hmean unlink1-processes-79 35630.01 ( 0.00%) 36855.30 ( 3.44%) Hmean unlink1-processes-110 36115.85 ( 0.00%) 39843.91 ( 10.32%) Hmean unlink1-processes-141 32546.96 ( 0.00%) 35418.52 ( 8.82%) Hmean unlink1-processes-172 34674.79 ( 0.00%) 36899.21 ( 6.42%) Hmean unlink1-processes-203 37303.11 ( 0.00%) 36393.04 ( -2.44%) Hmean unlink1-processes-224 35712.13 ( 0.00%) 36685.96 ( 2.73%) == 2. ppc64le == Avg runtime set_task_state(): 938 msecs Avg runtime set_current_state: 940 msecs vanilla dirty Hmean unlink1-processes-2 19269.19 ( 0.00%) 30704.50 ( 59.35%) Hmean unlink1-processes-5 20106.15 ( 0.00%) 21804.15 ( 8.45%) Hmean unlink1-processes-8 17496.97 ( 0.00%) 17243.28 ( -1.45%) Hmean unlink1-processes-12 14224.15 ( 0.00%) 17240.21 ( 21.20%) Hmean unlink1-processes-21 14155.66 ( 0.00%) 15681.23 ( 10.78%) Hmean unlink1-processes-30 14450.70 ( 0.00%) 15995.83 ( 10.69%) Hmean unlink1-processes-48 16945.57 ( 0.00%) 16370.42 ( -3.39%) Hmean unlink1-processes-79 15788.39 ( 0.00%) 14639.27 ( -7.28%) Hmean unlink1-processes-110 14268.48 ( 0.00%) 14377.40 ( 0.76%) Hmean unlink1-processes-141 14023.65 ( 0.00%) 16271.69 ( 16.03%) Hmean unlink1-processes-172 13417.62 ( 0.00%) 16067.55 ( 19.75%) Hmean unlink1-processes-203 15293.08 ( 0.00%) 15440.40 ( 0.96%) Hmean unlink1-processes-234 13719.32 ( 0.00%) 16190.74 ( 18.01%) Hmean unlink1-processes-265 16400.97 ( 0.00%) 16115.22 ( -1.74%) Hmean unlink1-processes-296 14388.60 ( 0.00%) 16216.13 ( 12.70%) Hmean unlink1-processes-320 15771.85 ( 0.00%) 15905.96 ( 0.85%) x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching) and ppc64 (with paca) show similar improvements in the unlink microbenches. The small delta for ppc64 (2ms), does not represent the gains on the unlink runs. In the case of x86, there was a decent amount of variation in the latency runs, but always within a 20 to 50ms increase), ppc was more constant. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: mark.rutland@arm.com Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>