aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/dm-crypt.c
AgeCommit message (Collapse)Author
2021-09-22dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()Arne Welzel
commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 upstream. On systems with many cores using dm-crypt, heavy spinlock contention in percpu_counter_compare() can be observed when the page allocation limit for a given device is reached or close to be reached. This is due to percpu_counter_compare() taking a spinlock to compute an exact result on potentially many CPUs at the same time. Switch to non-exact comparison of allocated and allowed pages by using the value returned by percpu_counter_read_positive() to avoid taking the percpu_counter spinlock. This may over/under estimate the actual number of allocated pages by at most (batch-1) * num_online_cpus(). Currently, batch is bounded by 32. The system on which this issue was first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt allocations, this seems an acceptable error. Certainly preferred over running into the spinlock contention. This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs and 192GB RAM as follows, but can be provoked on systems with less CPUs as well. * Disable swap * Tune vm settings to promote regular writeback $ echo 50 > /proc/sys/vm/dirty_expire_centisecs $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes * Create 8 dmcrypt devices based on files on a tmpfs * Create and mount an ext4 filesystem on each crypt devices * Run stress-ng --hdd 8 within one of above filesystems Total %system usage collected from sysstat goes to ~35%. Write throughput on the underlying loop device is ~2GB/s. perf profiling an individual kworker kcryptd thread shows the following profile, indicating spinlock contention in percpu_counter_compare(): 99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork | --ret_from_fork kthread worker_thread | --99.92%--process_one_work | |--80.52%--kcryptd_crypt | | | |--62.58%--mempool_alloc | | | | | --62.24%--crypt_page_alloc | | | | | --61.51%--__percpu_counter_compare | | | | | --61.34%--__percpu_counter_sum | | | | | |--58.68%--_raw_spin_lock_irqsave | | | | | | | --58.30%--native_queued_spin_lock_slowpath | | | | | --0.69%--cpumask_next | | | | | --0.51%--_find_next_bit | | | |--10.61%--crypt_convert | | | | | |--6.05%--xts_crypt ... After applying this patch and running the same test, %system usage is lowered to ~7% and write throughput on the loop device increases to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62% in the profile and not hitting the percpu_counter() spinlock anymore. |--8.15%--mempool_alloc | | | |--3.93%--crypt_page_alloc | | | | | --3.75%--__alloc_pages | | | | | --3.62%--get_page_from_freelist | | | | | --3.22%--rmqueue_bulk | | | | | --2.59%--_raw_spin_lock | | | | | --2.57%--native_queued_spin_lock_slowpath | | | --3.05%--_raw_spin_lock_irqsave | | | --2.49%--native_queued_spin_lock_slowpath Suggested-by: DJ Gregor <dj@corelight.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Arne Welzel <arne.welzel@corelight.com> Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-20dm crypt: avoid truncating the logical block sizeEric Biggers
commit 64611a15ca9da91ff532982429c44686f4593b5f upstream. queue_limits::logical_block_size got changed from unsigned short to unsigned int, but it was forgotten to update crypt_io_hints() to use the new type. Fix it. Fixes: ad6bf88a6c19 ("block: fix an integer overflow in logical block size") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14dm crypt: fix benbi IV constructor crash if used in authenticated modeMilan Broz
commit 4ea9471fbd1addb25a4d269991dc724e200ca5b5 upstream. If benbi IV is used in AEAD construction, for example: cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256 the constructor uses wrong skcipher function and crashes: BUG: kernel NULL pointer dereference, address: 00000014 ... EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt] Call Trace: ? crypt_subkey_size+0x20/0x20 [dm_crypt] crypt_ctr+0x567/0xfc0 [dm_crypt] dm_table_add_target+0x15f/0x340 [dm_mod] Fix this by properly using crypt_aead_blocksize() in this case. Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Cc: stable@vger.kernel.org # v4.12+ Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051 Reported-by: Jerad Simpson <jbsimpson@gmail.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-20dm: disable CRYPTO_TFM_REQ_MAY_SLEEP to fix a GFP_KERNEL recursion deadlockMikulas Patocka
[ Upstream commit 432061b3da64e488be3403124a72a9250bbe96d4 ] There's a XFS on dm-crypt deadlock, recursing back to itself due to the crypto subsystems use of GFP_KERNEL, reported here: https://bugzilla.kernel.org/show_bug.cgi?id=200835 * dm-crypt calls crypt_convert in xts mode * init_crypt from xts.c calls kmalloc(GFP_KERNEL) * kmalloc(GFP_KERNEL) recurses into the XFS filesystem, the filesystem tries to submit some bios and wait for them, causing a deadlock Fix this by updating both the DM crypt and integrity targets to no longer use the CRYPTO_TFM_REQ_MAY_SLEEP flag, which will change the crypto allocations from GFP_KERNEL to GFP_ATOMIC, therefore they can't recurse into a filesystem. A GFP_ATOMIC allocation can fail, but init_crypt() in xts.c handles the allocation failure gracefully - it will fall back to preallocated buffer if the allocation fails. The crypto API maintainer says that the crypto API only needs to allocate memory when dealing with unaligned buffers and therefore turning CRYPTO_TFM_REQ_MAY_SLEEP off is safe (see this discussion: https://www.redhat.com/archives/dm-devel/2018-August/msg00195.html ) Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
2019-02-20dm crypt: don't overallocate the integrity tag spaceMikulas Patocka
commit ff0c129d3b5ecb3df7c8f5e2236582bf745b6c5f upstream. bio_sectors() returns the value in the units of 512-byte sectors (no matter what the real sector size of the device). dm-crypt multiplies bio_sectors() by on_disk_tag_size to calculate the space allocated for integrity tags. If dm-crypt is running with sector size larger than 512b, it allocates more data than is needed. Device Mapper trims the extra space when passing the bio to dm-integrity, so this bug didn't result in any visible misbehavior. But it must be fixed to avoid wasteful memory allocation for the block integrity payload. Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Cc: stable@vger.kernel.org # 4.12+ Reported-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31dm crypt: fix parsing of extended IV argumentsMilan Broz
commit 1856b9f7bcc8e9bdcccc360aabb56fbd4dd6c565 upstream. The dm-crypt cipher specification in a mapping table is defined as: cipher[:keycount]-chainmode-ivmode[:ivopts] or (new crypt API format): capi:cipher_api_spec-ivmode[:ivopts] For ESSIV, the parameter includes hash specification, for example: aes-cbc-essiv:sha256 The implementation expected that additional IV option to never include another dash '-' character. But, with SHA3, there are names like sha3-256; so the mapping table parser fails: dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0" or (new crypt API format) dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0" device-mapper: crypt: Ignoring unexpected additional cipher options device-mapper: table: 253:0: crypt: Error creating IV device-mapper: ioctl: error adding target to table Fix the dm-crypt constructor to ignore additional dash in IV options and also remove a bogus warning (that is ignored anyway). Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-26dm crypt: use u64 instead of sector_t to store iv_offsetAliOS system security
[ Upstream commit 8d683dcd65c037efc9fb38c696ec9b65b306e573 ] The iv_offset in the mapping table of crypt target is a 64bit number when IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to iv_offset of struct crypt_config, cc_sector of struct convert_context and iv_sector of struct dm_crypt_request. These structures members are defined as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit kernel. In this situation sector_t is not big enough to store the 64bit iv_offset. Here is a reproducer. Prepare test image and device (loop is automatically allocated by cryptsetup): # dd if=/dev/zero of=tst.img bs=1M count=1 # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \ --skip 500000000000000000 tst.img test On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off) and device checksum is wrong: # dmsetup table test --showkeys 0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0 # sha256sum /dev/mapper/test 533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test On 64bit system (and on 32bit system with the patch), table and checksum is now correct: # dmsetup table test --showkeys 0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0 # sha256sum /dev/mapper/test 5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test Signed-off-by: AliOS system security <alios_sys_security@linux.alibaba.com> Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-09-09dm crypt: don't decrease device limitsMikulas Patocka
commit bc9e9cf0401f18e33b78d4c8a518661b8346baf7 upstream. dm-crypt should only increase device limits, it should not decrease them. This fixes a bug where the user could creates a crypt device with 1024 sector size on the top of scsi device that had 4096 logical block size. The limit 4096 would be lost and the user could incorrectly send 1024-I/Os to the crypt device. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24dm crypt: limit the number of allocated pagesMikulas Patocka
commit 5059353df86e2573ccd9d43fd9d9396dcec47ca2 upstream. dm-crypt consumes an excessive amount memory when the user attempts to zero a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout, __blkdev_issue_zeroout sends a large amount of write bios that contain the zero page as their payload. For each incoming page, dm-crypt allocates another page that holds the encrypted data, so when processing "blkdiscard -z", dm-crypt tries to allocate the amount of memory that is equal to the size of the device. This can trigger OOM killer or cause system crash. Fix this by limiting the amount of memory that dm-crypt allocates to 2% of total system memory. This limit is system-wide and is divided by the number of active dm-crypt devices and each device receives an equal share. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm crypt: fix error return code in crypt_ctr()Wei Yongjun
commit 3cc2e57c4beabcbbaa46e1ac6d77ca8276a4a42d upstream. Fix to return error code -ENOMEM from the mempool_create_kmalloc_pool() error handling case instead of 0, as done elsewhere in this function. Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm crypt: wipe kernel key copy after IV initializationOndrej Kozina
commit dc94902bde1e158cd19c4deab208e5d6eb382a44 upstream. Loading key via kernel keyring service erases the internal key copy immediately after we pass it in crypto layer. This is wrong because IV is initialized later and we use wrong key for the initialization (instead of real key there's just zeroed block). The bug may cause data corruption if key is loaded via kernel keyring service first and later same crypt device is reactivated using exactly same key in hexbyte representation, or vice versa. The bug (and fix) affects only ciphers using following IVs: essiv, lmk and tcw. Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service") Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm crypt: fix crash by adding missing check for auth key sizeMilan Broz
commit 27c7003697fc2c78f965984aa224ef26cd6b2949 upstream. If dm-crypt uses authenticated mode with separate MAC, there are two concatenated part of the key structure - key(s) for encryption and authentication key. Add a missing check for authenticated key length. If this key length is smaller than actually provided key, dm-crypt now properly fails instead of crashing. Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Reported-by: Salah Coronya <salahx@yahoo.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30dm crypt: allow unaligned bv_offsetMikulas Patocka
commit 0440d5c0ca9744b92a07aeb6df0a9a75db6f4280 upstream. When slub_debug is enabled kmalloc returns unaligned memory. XFS uses this unaligned memory for its buffers (if an unaligned buffer crosses a page, XFS frees it and allocates a full page instead - see the function xfs_buf_allocate_memory). dm-crypt checks if bv_offset is aligned on page size and these checks fail with slub_debug and XFS. Fix this bug by removing the bv_offset checks. Switch to checking if bv_len is aligned instead of bv_offset (this check should be sufficient to prevent overruns if a bio with too small bv_len is received). Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size") Reported-by: Bruno Prémont <bonbons@sysophe.eu> Tested-by: Bruno Prémont <bonbons@sysophe.eu> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-04dm crypt: reject sector_size feature if device length is not aligned to itMilan Broz
If a crypt mapping uses optional sector_size feature, additional restrictions to mapped device segment size must be applied in constructor, otherwise the device activation will fail later. Fixes: 8f0009a225 ("dm crypt: optionally support larger encryption sector size") Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-28dm crypt: fix memory leak in crypt_ctr_cipher_old()Jeffy Chen
Fix memory leak of cipher_api. Fixes: 33d2f09fcb35 (dm crypt: introduce new format of cipher with "capi:" prefix) Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-14Merge tag 'for-4.14/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Some request-based DM core and DM multipath fixes and cleanups - Constify a few variables in DM core and DM integrity - Add bufio optimization and checksum failure accounting to DM integrity - Fix DM integrity to avoid checking integrity of failed reads - Fix DM integrity to use init_completion - A couple DM log-writes target fixes - Simplify DAX flushing by eliminating the unnecessary flush abstraction that was stood up for DM's use. * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dax: remove the pmem_dax_ops->flush abstraction dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK dm integrity: make blk_integrity_profile structure const dm integrity: do not check integrity for failed read operations dm log writes: fix >512b sectorsize support dm log writes: don't use all the cpu while waiting to log blocks dm ioctl: constify ioctl lookup table dm: constify argument arrays dm integrity: count and display checksum failures dm integrity: optimize writing dm-bufio buffers that are partially changed dm rq: do not update rq partially in each ending bio dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior dm mpath: complain about unsupported __multipath_map_bio() return values dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-08-28dm: constify argument arraysEric Biggers
The arrays of 'struct dm_arg' are never modified by the device-mapper core, so constify them so that they are placed in .rodata. (Exception: the args array in dm-raid cannot be constified because it is allocated on the stack and modified.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09dm-crypt: don't mess with BIP_BLOCK_INTEGRITYChristoph Hellwig
This flag is never set right after calling bio_integrity_alloc, so don't clear it and confuse the reader. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-04crypto: algapi - make crypto_xor() take separate dst and src argumentsArd Biesheuvel
There are quite a number of occurrences in the kernel of the pattern if (dst != src) memcpy(dst, src, walk.total % AES_BLOCK_SIZE); crypto_xor(dst, final, walk.total % AES_BLOCK_SIZE); or crypto_xor(keystream, src, nbytes); memcpy(dst, keystream, nbytes); where crypto_xor() is preceded or followed by a memcpy() invocation that is only there because crypto_xor() uses its output parameter as one of the inputs. To avoid having to add new instances of this pattern in the arm64 code, which will be refactored to implement non-SIMD fallbacks, add an alternative implementation called crypto_xor_cpy(), taking separate input and output arguments. This removes the need for the separate memcpy(). Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-06-19dm crypt: add big-endian variant of plain64 IVMilan Broz
The big-endian IV (plain64be) is needed to map images from extracted disks that are used in some external (on-chip FDE) disk encryption drives, e.g.: data recovery from external USB/SATA drives that support "internal" encryption. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-18blk: make the bioset rescue_workqueue optional.NeilBrown
This patch converts bioset_create() to not create a workqueue by default, so alloctions will never trigger punt_bios_to_rescuer(). It also introduces a new flag BIOSET_NEED_RESCUER which tells bioset_create() to preserve the old behavior. All callers of bioset_create() that are inside block device drivers, are given the BIOSET_NEED_RESCUER flag. biosets used by filesystems or other top-level users do not need rescuing as the bio can never be queued behind other bios. This includes fs_bio_set, blkdev_dio_pool, btrfs_bioset, xfs_ioend_bioset, and one allocated by target_core_iblock.c. biosets used by md/raid do not need rescuing as their usage was recently audited and revised to never risk deadlock. It is hoped that most, if not all, of the remaining biosets can end up being the non-rescued version. Reviewed-by: Christoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk: replace bioset_create_nobvec() with a flags arg to bioset_create()NeilBrown
"flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-09block: switch bios to blk_status_tChristoph Hellwig
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09dm: don't return errnos from ->mapChristoph Hellwig
Instead use the special DM_MAPIO_KILL return value to return -EIO just like we do for the request based path. Note that dm-log-writes returned -ENOMEM in a few places, which now becomes -EIO instead. No consumer treats -ENOMEM special so this shouldn't be an issue (and it should use a mempool to start with to make guaranteed progress). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-03Merge tag 'for-4.12/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - A major update for DM cache that reduces the latency for deciding whether blocks should migrate to/from the cache. The bio-prison-v2 interface supports this improvement by enabling direct dispatch of work to workqueues rather than having to delay the actual work dispatch to the DM cache core. So the dm-cache policies are much more nimble by being able to drive IO as they see fit. One immediate benefit from the improved latency is a cache that should be much more adaptive to changing workloads. - Add a new DM integrity target that emulates a block device that has additional per-sector tags that can be used for storing integrity information. - Add a new authenticated encryption feature to the DM crypt target that builds on the capabilities provided by the DM integrity target. - Add MD interface for switching the raid4/5/6 journal mode and update the DM raid target to use it to enable aid4/5/6 journal write-back support. - Switch the DM verity target over to using the asynchronous hash crypto API (this helps work better with architectures that have access to off-CPU algorithm providers, which should reduce CPU utilization). - Various request-based DM and DM multipath fixes and improvements from Bart and Christoph. - A DM thinp target fix for a bio structure leak that occurs for each discard IFF discard passdown is enabled. - A fix for a possible deadlock in DM bufio and a fix to re-check the new buffer allocation watermark in the face of competing admin changes to the 'max_cache_size_bytes' tunable. - A couple DM core cleanups. * tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits) dm bufio: check new buffer allocation watermark every 30 seconds dm bufio: avoid a possible ABBA deadlock dm mpath: make it easier to detect unintended I/O request flushes dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit() dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH dm: introduce enum dm_queue_mode to cleanup related code dm mpath: verify __pg_init_all_paths locking assumptions at runtime dm: verify suspend_locking assumptions at runtime dm block manager: remove an unused argument from dm_block_manager_create() dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue() dm mpath: delay requeuing while path initialization is in progress dm mpath: avoid that path removal can trigger an infinite loop dm mpath: split and rename activate_path() to prepare for its expanded use dm ioctl: prevent stack leak in dm ioctl call dm integrity: use previously calculated log2 of sectors_per_block dm integrity: use hex2bin instead of open-coded variant dm crypt: replace custom implementation of hex2bin() dm crypt: remove obsolete references to per-CPU state dm verity: switch to using asynchronous hash crypto API dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues ...
2017-04-27dm crypt: replace custom implementation of hex2bin()Andy Shevchenko
There is no need to have a duplication of the generic library, i.e. hex2bin(). Replace the open coded variant. Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-25dm crypt: remove obsolete references to per-CPU stateEric Biggers
dm-crypt used to use separate crypto transforms for each CPU, but this is no longer the case. To avoid confusion, fix up obsolete comments and rename setup_essiv_cpu(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24dm crypt: use WQ_HIGHPRI for the IO and crypt workqueuesTim Murray
Running dm-crypt with workqueues at the standard priority results in IO competing for CPU time with standard user apps, which can lead to pipeline bubbles and seriously degraded performance. Move to using WQ_HIGHPRI workqueues to protect against that. Signed-off-by: Tim Murray <timmurray@google.com> Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24dm crypt: rewrite (wipe) key in crypto layer using random dataOndrej Kozina
The message "key wipe" used to wipe real key stored in crypto layer by rewriting it with zeroes. Since commit 28856a9 ("crypto: xts - consolidate sanity check for keys") this no longer works in FIPS mode for XTS. While running in FIPS mode the crypto key part has to differ from the tweak key. Fixes: 28856a9 ("crypto: xts - consolidate sanity check for keys") Cc: stable@vger.kernel.org Signed-off-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24dm crypt: fix large block integrity supportMikulas Patocka
Previously, dm-crypt could use blocks composed of multiple 512b sectors but it created integrity profile for each 512b sector (it padded it with zeroes). Fix dm-crypt so that the integrity profile is sent for each block not each sector. The user must use the same block size in the DM crypt and integrity targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-08block: remove the discard_zeroes_data flagChristoph Hellwig
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24dm crypt: use shifts instead of sector_divMikulas Patocka
sector_div is very slow, so we introduce a variable sector_shift and use shift instead of sector_div. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24dm crypt: optionally support larger encryption sector sizeMilan Broz
Add optional "sector_size" parameter that specifies encryption sector size (atomic unit of block device encryption). Parameter can be in range 512 - 4096 bytes and must be power of two. For compatibility reasons, the maximal IO must fit into the page limit, so the limit is set to the minimal page size possible (4096 bytes). NOTE: this device cannot yet be handled by cryptsetup if this parameter is set. IV for the sector is calculated from the 512 bytes sector offset unless the iv_large_sectors option is used. Test script using dmsetup: DEV="/dev/sdb" DEV_SIZE=$(blockdev --getsz $DEV) KEY="9c1185a5c5e9fc54612808977ee8f548b2258d31ddadef707ba62c166051b9e3cd0294c27515f2bccee924e8823ca6e124b8fc3167ed478bca702babe4e130ac" BLOCK_SIZE=4096 # dmsetup create test_crypt --table "0 $DEV_SIZE crypt aes-xts-plain64 $KEY 0 $DEV 0 1 sector_size:$BLOCK_SIZE" # dmsetup table --showkeys test_crypt Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24dm crypt: introduce new format of cipher with "capi:" prefixMilan Broz
For the new authenticated encryption we have to support generic composed modes (combination of encryption algorithm and authenticator) because this is how the kernel crypto API accesses such algorithms. To simplify the interface, we accept an algorithm directly in crypto API format. The new format is recognised by the "capi:" prefix. The dmcrypt internal IV specification is the same as for the old format. The crypto API cipher specifications format is: capi:cipher_api_spec-ivmode[:ivopts] Examples: capi:cbc(aes)-essiv:sha256 (equivalent to old aes-cbc-essiv:sha256) capi:xts(aes)-plain64 (equivalent to old aes-xts-plain64) Examples of authenticated modes: capi:gcm(aes)-random capi:authenc(hmac(sha256),xts(aes))-random capi:rfc7539(chacha20,poly1305)-random Authenticated modes can only be configured using the new cipher format. Note that this format allows user to specify arbitrary combinations that can be insecure. (Policy decision is done in cryptsetup userspace.) Authenticated encryption algorithms can be of two types, either native modes (like GCM) that performs both encryption and authentication internally, or composed modes where user can compose AEAD with separate specification of encryption algorithm and authenticator. For composed mode with HMAC (length-preserving encryption mode like an XTS and HMAC as an authenticator) we have to calculate HMAC digest size (the separate authentication key is the same size as the HMAC digest). Introduce crypt_ctr_auth_cipher() to parse the crypto API string to get HMAC algorithm and retrieve digest size from it. Also, for HMAC composed mode we need to parse the crypto API string to get the cipher mode nested in the specification. For native AEAD mode (like GCM), we can use crypto_tfm_alg_name() API to get the cipher specification. Because the HMAC composed mode is not processed the same as the native AEAD mode, the CRYPT_MODE_INTEGRITY_HMAC flag is no longer needed and "hmac" specification for the table integrity argument is removed. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24dm crypt: factor IV constructor out to separate functionMilan Broz
No functional change. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24dm crypt: add cryptographic data integrity protection (authenticated encryption)Milan Broz
Allow the use of per-sector metadata, provided by the dm-integrity module, for integrity protection and persistently stored per-sector Initialization Vector (IV). The underlying device must support the "DM-DIF-EXT-TAG" dm-integrity profile. The per-bio integrity metadata is allocated by dm-crypt for every bio. Example of low-level mapping table for various types of use: DEV=/dev/sdb SIZE=417792 # Additional HMAC with CBC-ESSIV, key is concatenated encryption key + HMAC key SIZE_INT=389952 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 32 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-cbc-essiv:sha256 \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \ 0 /dev/mapper/x 0 1 integrity:32:hmac(sha256)" # AEAD (Authenticated Encryption with Additional Data) - GCM with random IVs # GCM in kernel uses 96bits IV and we store 128bits auth tag (so 28 bytes metadata space) SIZE_INT=393024 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 28 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-gcm-random \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 0 /dev/mapper/x 0 1 integrity:28:aead" # Random IV only for XTS mode (no integrity protection but provides atomic random sector change) SIZE_INT=401272 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 16 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 0 /dev/mapper/x 0 1 integrity:16:none" # Random IV with XTS + HMAC integrity protection SIZE_INT=377656 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 48 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \ 0 /dev/mapper/x 0 1 integrity:48:hmac(sha256)" Both AEAD and HMAC protection authenticates not only data but also sector metadata. HMAC protection is implemented through autenc wrapper (so it is processed the same way as an authenticated mode). In HMAC mode there are two keys (concatenated in dm-crypt mapping table). First is the encryption key and the second is the key for authentication (HMAC). (It is userspace decision if these keys are independent or somehow derived.) The sector request for AEAD/HMAC authenticated encryption looks like this: |----- AAD -------|------ DATA -------|-- AUTH TAG --| | (authenticated) | (auth+encryption) | | | sector_LE | IV | sector in/out | tag in/out | For writes, the integrity fields are calculated during AEAD encryption of every sector and stored in bio integrity fields and sent to underlying dm-integrity target for storage. For reads, the integrity metadata is verified during AEAD decryption of every sector (they are filled in by dm-integrity, but the integrity fields are pre-allocated in dm-crypt). There is also an experimental support in cryptsetup utility for more friendly configuration (part of LUKS2 format). Because the integrity fields are not valid on initial creation, the device must be "formatted". This can be done by direct-io writes to the device (e.g. dd in direct-io mode). For now, there is available trivial tool to do this, see: https://github.com/mbroz/dm_int_tools Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com> Signed-off-by: Vashek Matyas <matyas@fi.muni.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-02KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()David Howells
rcu_dereference_key() and user_key_payload() are currently being used in two different, incompatible ways: (1) As a wrapper to rcu_dereference() - when only the RCU read lock used to protect the key. (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is used to protect the key and the may be being modified. Fix this by splitting both of the key wrappers to produce: (1) RCU accessors for keys when caller has the key semaphore locked: dereference_key_locked() user_key_payload_locked() (2) RCU accessors for keys when caller holds the RCU read lock: dereference_key_rcu() user_key_payload_rcu() This should fix following warning in the NFS idmapper =============================== [ INFO: suspicious RCU usage. ] 4.10.0 #1 Tainted: G W ------------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by mount.nfs/5987: #0: (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4] stack backtrace: CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1 Call Trace: dump_stack+0xe8/0x154 (unreliable) lockdep_rcu_suspicious+0x140/0x190 nfs_idmap_get_key+0x380/0x420 [nfsv4] nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4] decode_getfattr_attrs+0xfac/0x16b0 [nfsv4] decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4] nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4] rpcauth_unwrap_resp+0xe8/0x140 [sunrpc] call_decode+0x29c/0x910 [sunrpc] __rpc_execute+0x140/0x8f0 [sunrpc] rpc_run_task+0x170/0x200 [sunrpc] nfs4_call_sync_sequence+0x68/0xa0 [nfsv4] _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4] nfs4_lookup_root+0xe0/0x350 [nfsv4] nfs4_lookup_root_sec+0x70/0xa0 [nfsv4] nfs4_find_root_sec+0xc4/0x100 [nfsv4] nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4] nfs4_get_rootfh+0x6c/0x190 [nfsv4] nfs4_server_common_setup+0xc4/0x260 [nfsv4] nfs4_create_server+0x278/0x3c0 [nfsv4] nfs4_remote_mount+0x50/0xb0 [nfsv4] mount_fs+0x74/0x210 vfs_kern_mount+0x78/0x220 nfs_do_root_mount+0xb0/0x140 [nfsv4] nfs4_try_mount+0x60/0x100 [nfsv4] nfs_fs_mount+0x5ec/0xda0 [nfs] mount_fs+0x74/0x210 vfs_kern_mount+0x78/0x220 do_mount+0x254/0xf70 SyS_mount+0x94/0x100 system_call+0x38/0xe0 Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-02-20Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Implement wraparound-safe refcount_t and kref_t types based on generic atomic primitives (Peter Zijlstra) - Improve and fix the ww_mutex code (Nicolai Hähnle) - Add self-tests to the ww_mutex code (Chris Wilson) - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr Bueso) - Micro-optimize the current-task logic all around the core kernel (Davidlohr Bueso) - Tidy up after recent optimizations: remove stale code and APIs, clean up the code (Waiman Long) - ... plus misc fixes, updates and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) fork: Fix task_struct alignment locking/spinlock/debug: Remove spinlock lockup detection code lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS lkdtm: Convert to refcount_t testing kref: Implement 'struct kref' using refcount_t refcount_t: Introduce a special purpose refcount type sched/wake_q: Clarify queue reinit comment sched/wait, rcuwait: Fix typo in comment locking/mutex: Fix lockdep_assert_held() fail locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock() locking/rwsem: Reinit wake_q after use locking/rwsem: Remove unnecessary atomic_long_t casts jump_labels: Move header guard #endif down where it belongs locking/atomic, kref: Implement kref_put_lock() locking/ww_mutex: Turn off __must_check for now locking/atomic, kref: Avoid more abuse locking/atomic, kref: Use kref_get_unless_zero() more locking/atomic, kref: Kill kref_sub() locking/atomic, kref: Add kref_read() locking/atomic, kref: Add KREF_INIT() ...
2017-02-03dm crypt: replace RCU read-side section with rwsemOndrej Kozina
The lockdep splat below hints at a bug in RCU usage in dm-crypt that was introduced with commit c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service"). The kernel keyring function user_key_payload() is in fact a wrapper for rcu_dereference_protected() which must not be called with only rcu_read_lock() section mark. Unfortunately the kernel keyring subsystem doesn't currently provide an interface that allows the use of an RCU read-side section. So for now we must drop RCU in favour of rwsem until a proper function is made available in the kernel keyring subsystem. =============================== [ INFO: suspicious RCU usage. ] 4.10.0-rc5 #2 Not tainted ------------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 2 locks held by cryptsetup/6464: #0: (&md->type_lock){+.+.+.}, at: [<ffffffffa02472a2>] dm_lock_md_type+0x12/0x20 [dm_mod] #1: (rcu_read_lock){......}, at: [<ffffffffa02822f8>] crypt_set_key+0x1d8/0x4b0 [dm_crypt] stack backtrace: CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 Call Trace: dump_stack+0x67/0x92 lockdep_rcu_suspicious+0xc5/0x100 crypt_set_key+0x351/0x4b0 [dm_crypt] ? crypt_set_key+0x1d8/0x4b0 [dm_crypt] crypt_ctr+0x341/0xa53 [dm_crypt] dm_table_add_target+0x147/0x330 [dm_mod] table_load+0x111/0x350 [dm_mod] ? retrieve_status+0x1c0/0x1c0 [dm_mod] ctl_ioctl+0x1f5/0x510 [dm_mod] dm_ctl_ioctl+0xe/0x20 [dm_mod] do_vfs_ioctl+0x8e/0x690 ? ____fput+0x9/0x10 ? task_work_run+0x7e/0xa0 ? trace_hardirqs_on_caller+0x122/0x1b0 SyS_ioctl+0x3c/0x70 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x7f392c9a4ec7 RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7 RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005 RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0 R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08 R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service") Reported-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-14sched/core: Remove set_task_state()Davidlohr Bueso
This is a nasty interface and setting the state of a foreign task must not be done. As of the following commit: be628be0956 ("bcache: Make gc wakeup sane, remove set_task_state()") ... everyone in the kernel calls set_task_state() with current, allowing the helper to be removed. However, as the comment indicates, it is still around for those archs where computing current is more expensive than using a pointer, at least in theory. An important arch that is affected is arm64, however this has been addressed now [1] and performance is up to par making no difference with either calls. Of all the callers, if any, it's the locking bits that would care most about this -- ie: we end up passing a tsk pointer to a lot of the lock slowpath, and setting ->state on that. The following numbers are based on two tests: a custom ad-hoc microbenchmark that just measures latencies (for ~65 million calls) between get_task_state() vs get_current_state(). Secondly for a higher overview, an unlink microbenchmark was used, which pounds on a single file with open, close,unlink combos with increasing thread counts (up to 4x ncpus). While the workload is quite unrealistic, it does contend a lot on the inode mutex or now rwsem. [1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com == 1. x86-64 == Avg runtime set_task_state(): 601 msecs Avg runtime set_current_state(): 552 msecs vanilla dirty Hmean unlink1-processes-2 36089.26 ( 0.00%) 38977.33 ( 8.00%) Hmean unlink1-processes-5 28555.01 ( 0.00%) 29832.55 ( 4.28%) Hmean unlink1-processes-8 37323.75 ( 0.00%) 44974.57 ( 20.50%) Hmean unlink1-processes-12 43571.88 ( 0.00%) 44283.01 ( 1.63%) Hmean unlink1-processes-21 34431.52 ( 0.00%) 38284.45 ( 11.19%) Hmean unlink1-processes-30 34813.26 ( 0.00%) 37975.17 ( 9.08%) Hmean unlink1-processes-48 37048.90 ( 0.00%) 39862.78 ( 7.59%) Hmean unlink1-processes-79 35630.01 ( 0.00%) 36855.30 ( 3.44%) Hmean unlink1-processes-110 36115.85 ( 0.00%) 39843.91 ( 10.32%) Hmean unlink1-processes-141 32546.96 ( 0.00%) 35418.52 ( 8.82%) Hmean unlink1-processes-172 34674.79 ( 0.00%) 36899.21 ( 6.42%) Hmean unlink1-processes-203 37303.11 ( 0.00%) 36393.04 ( -2.44%) Hmean unlink1-processes-224 35712.13 ( 0.00%) 36685.96 ( 2.73%) == 2. ppc64le == Avg runtime set_task_state(): 938 msecs Avg runtime set_current_state: 940 msecs vanilla dirty Hmean unlink1-processes-2 19269.19 ( 0.00%) 30704.50 ( 59.35%) Hmean unlink1-processes-5 20106.15 ( 0.00%) 21804.15 ( 8.45%) Hmean unlink1-processes-8 17496.97 ( 0.00%) 17243.28 ( -1.45%) Hmean unlink1-processes-12 14224.15 ( 0.00%) 17240.21 ( 21.20%) Hmean unlink1-processes-21 14155.66 ( 0.00%) 15681.23 ( 10.78%) Hmean unlink1-processes-30 14450.70 ( 0.00%) 15995.83 ( 10.69%) Hmean unlink1-processes-48 16945.57 ( 0.00%) 16370.42 ( -3.39%) Hmean unlink1-processes-79 15788.39 ( 0.00%) 14639.27 ( -7.28%) Hmean unlink1-processes-110 14268.48 ( 0.00%) 14377.40 ( 0.76%) Hmean unlink1-processes-141 14023.65 ( 0.00%) 16271.69 ( 16.03%) Hmean unlink1-processes-172 13417.62 ( 0.00%) 16067.55 ( 19.75%) Hmean unlink1-processes-203 15293.08 ( 0.00%) 15440.40 ( 0.96%) Hmean unlink1-processes-234 13719.32 ( 0.00%) 16190.74 ( 18.01%) Hmean unlink1-processes-265 16400.97 ( 0.00%) 16115.22 ( -1.74%) Hmean unlink1-processes-296 14388.60 ( 0.00%) 16216.13 ( 12.70%) Hmean unlink1-processes-320 15771.85 ( 0.00%) 15905.96 ( 0.85%) x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching) and ppc64 (with paca) show similar improvements in the unlink microbenches. The small delta for ppc64 (2ms), does not represent the gains on the unlink runs. In the case of x86, there was a decent amount of variation in the latency runs, but always within a 20 to 50ms increase), ppc was more constant. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: mark.rutland@arm.com Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-14Merge tag 'dm-4.10-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and improvements to request-based DM and DM multipath - some locking improvements in DM bufio - add Kconfig option to disable the DM block manager's extra locking which mainly serves as a developer tool - a few bug fixes to DM's persistent-data - a couple changes to prepare for multipage biovec support in the block layer - various improvements and cleanups in the DM core, DM cache, DM raid and DM crypt - add ability to have DM crypt use keys from the kernel key retention service - add a new "error_writes" feature to the DM flakey target, reads are left unchanged in this mode * tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (40 commits) dm flakey: introduce "error_writes" feature dm cache policy smq: use hash_32() instead of hash_32_generic() dm crypt: reject key strings containing whitespace chars dm space map: always set ev if sm_ll_mutate() succeeds dm space map metadata: skip useless memcpy in metadata_ll_init_index() dm space map metadata: fix 'struct sm_metadata' leak on failed create Documentation: dm raid: define data_offset status field dm raid: fix discard support regression dm raid: don't allow "write behind" with raid4/5/6 dm mpath: use hw_handler_params if attached hw_handler is same as requested dm crypt: add ability to use keys from the kernel key retention service dm array: remove a dead assignment in populate_ablock_with_values() dm ioctl: use offsetof() instead of open-coding it dm rq: simplify use_blk_mq initialization dm: use blk_set_queue_dying() in __dm_destroy() dm bufio: drop the lock when doing GFP_NOIO allocation dm bufio: don't take the lock in dm_bufio_shrink_count dm bufio: avoid sleeping while holding the dm_bufio lock dm table: simplify dm_table_determine_type() dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device ...
2016-12-08dm crypt: reject key strings containing whitespace charsOndrej Kozina
Unfortunately key_string may theoretically contain whitespace even after it's processed by dm_split_args(). The reason for this is DM core supports escaping of almost all chars including any whitespace. If userspace passes a key to the kernel in format ":32:logon:my_prefix:my\ key" dm-crypt will look up key "my_prefix:my key" in kernel keyring service. So far everything's fine. Unfortunately if userspace later calls DM_TABLE_STATUS ioctl, it will not receive back expected ":32:logon:my_prefix:my\ key" but the unescaped version instead. Also userpace (most notably cryptsetup) is not ready to parse single target argument containing (even escaped) whitespace chars and any whitespace is simply taken as delimiter of another argument. This effect is mitigated by the fact libdevmapper curently performs double escaping of '\' char. Any user input in format "x\ x" is transformed into "x\\ x" before being passed to the kernel. Nonetheless dm-crypt may be used without libdevmapper. Therefore the near-term solution to this is to reject any key string containing whitespace. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm crypt: add ability to use keys from the kernel key retention serviceOndrej Kozina
The kernel key service is a generic way to store keys for the use of other subsystems. Currently there is no way to use kernel keys in dm-crypt. This patch aims to fix that. Instead of key userspace may pass a key description with preceding ':'. So message that constructs encryption mapping now looks like this: <cipher> [<key>|:<key_string>] <iv_offset> <dev_path> <start> [<#opt_params> <opt_params>] where <key_string> is in format: <key_size>:<key_type>:<key_description> Currently we only support two elementary key types: 'user' and 'logon'. Keys may be loaded in dm-crypt either via <key_string> or using classical method and pass the key in hex representation directly. dm-crypt device initialised with a key passed in hex representation may be replaced with key passed in key_string format and vice versa. (Based on original work by Andrey Ryabinin) Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: constify crypt_iv_operations structuresJulia Lawall
The crypt_iv_operations are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: rename crypt_setkey_allcpus to crypt_setkeyMikulas Patocka
In the past, dm-crypt used per-cpu crypto context. This has been removed in the kernel 3.15 and the crypto context is shared between all cpus. This patch renames the function crypt_setkey_allcpus to crypt_setkey, because there is really no activity that is done for all cpus. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: mark key as invalid until properly loadedOndrej Kozina
In crypt_set_key(), if a failure occurs while replacing the old key (e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag set. Otherwise, the crypto layer would have an invalid key that still has DM_CRYPT_KEY_VALID flag set. Cc: stable@vger.kernel.org Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: use bio_add_page()Ming Lei
Use bio_add_page(), the standard interface for adding a page to a bio, rather than open-coding the same. It should be noted that the 'clone' bio that is allocated using bio_alloc_bioset(), in crypt_alloc_buffer(), does _not_ set the bio's BIO_CLONED flag. As such, bio_add_page()'s early return for true bio clones (those with BIO_CLONED set) isn't applicable. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-28block: better op and flags encodingChristoph Hellwig
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>