aboutsummaryrefslogtreecommitdiffstats
path: root/fs/nfs/write.c
AgeCommit message (Collapse)Author
2017-08-15NFS: Don't unlock writebacks before declaring PG_WB_ENDTrond Myklebust
We don't want nfs_lock_and_join_requests() to start fiddling with the request before the call to nfs_page_group_sync_on_bit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Don't check request offset and size without holding a lockTrond Myklebust
Request offsets and sizes are not guaranteed to be stable unless you are holding the request locked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix an ABBA issue in nfs_lock_and_join_requests()Trond Myklebust
All other callers of nfs_page_group_lock() appear to already hold the page lock on the head page, so doing it in the opposite order here is inefficient, although not deadlock prone since we roll back all locks on contention. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()Trond Myklebust
Yes, this is a situation that should never happen (hence the WARN_ON) but we should still ensure that we free up the locks and references to the faulty pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce lock contention in nfs_try_to_update_request()Trond Myklebust
Micro-optimisation to move the lockless check into the for(;;) loop. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce lock contention in nfs_page_find_head_request()Trond Myklebust
Add a lockless check for whether or not the page might be carrying an existing writeback before we grab the inode->i_lock. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Simplify page writebackTrond Myklebust
We don't expect the page header lock to ever be held across I/O, so it should always be safe to wait for it, even if we're doing nonblocking writebacks. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13NFS: Fix commit policy for non-blocking calls to nfs_write_inode()Trond Myklebust
Now that the writes will schedule a commit on their own, we don't need nfs_write_inode() to schedule one if there are outstanding writes, and we're being called in non-blocking mode. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13NFS: Ensure we commit after writeback is completeTrond Myklebust
If the page cache is being flushed, then we want to ensure that we do start a commit once the pages are done being flushed. If we just wait until all I/O is done to that file, we can end up livelocking until the balance_dirty_pages() mechanism puts its foot down and forces I/O to stop. So instead we do more or less the same thing that O_DIRECT does, and set up a counter to tell us when the flush is done, Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-05-10Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - Fix use after free in write error path - Use GFP_NOIO for two allocations in writeback - Fix a hang in OPEN related to server reboot - Check the result of nfs4_pnfs_ds_connect - Fix an rcu lock leak Features: - Removal of the unmaintained and unused OSD pNFS layout - Cleanup and removal of lots of unnecessary dprintk()s - Cleanup and removal of some memory failure paths now that GFP_NOFS is guaranteed to never fail. - Remove the v3-only data server limitation on pNFS/flexfiles Bugfixes: - RPC/RDMA connection handling bugfixes - Copy offload: fixes to ensure the copied data is COMMITed to disk. - Readdir: switch back to using the ->iterate VFS interface - File locking fixes from Ben Coddington - Various use-after-free and deadlock issues in pNFS - Write path bugfixes" * tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits) pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled NFSv4.1: Work around a Linux server bug... NFS append COMMIT after synchronous COPY NFSv4: Fix exclusive create attributes encoding NFSv4: Fix an rcu lock leak nfs: use kmap/kunmap directly NFS: always treat the invocation of nfs_getattr as cache hit when noac is on Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits pNFS: Fix a typo in pnfs_generic_alloc_ds_commits pNFS: Fix a deadlock when coalescing writes and returning the layout pNFS: Don't clear the layout return info if there are segments to return pNFS: Ensure we commit the layout if it has been invalidated pNFS: Don't send COMMITs to the DSes if the server invalidated our layout pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path pNFS: Ensure we check layout validity before marking it for return NFS4.1 handle interrupted slot reuse from ERR_DELAY NFSv4: check return value of xdr_inline_decode nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout() ...
2017-05-08NFS append COMMIT after synchronous COPYOlga Kornievskaia
Instead of messing with the commit path which has been causing issues, add a COMMIT op after the COPY and ask for stable copies in the first space. It saves a round trip, since after the COPY, the client sends a COMMIT anyway. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26NFSv4: Don't special case "launder"Trond Myklebust
If the client receives a fatal server error from nfs_pageio_add_request(), then we should always truncate the page on which the error occurred. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25NFS: Don't write back further requests if there is a pending write errorTrond Myklebust
If the server has already returned a fatal write error that the user has not yet received on this file, then don't write back the other pages. Instead, act as if they have been sent, and have returned with the same error. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20nfs: Convert to separately allocated bdiJan Kara
Allocate struct backing_dev_info separately instead of embedding it inside the superblock. This unifies handling of bdi among users. CC: Anna Schumaker <anna.schumaker@netapp.com> CC: linux-nfs@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20NFS: move rw_mode to nfs_pageio_headerBenjamin Coddington
Let's try to have it in a cacheline in nfs4_proc_pgio_rpc_prepare(). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20NFS: Fix use after free in write error pathFred Isaman
Signed-off-by: Fred Isaman <fred.isaman@gmail.com> Fixes: 0bcbf039f6b2b ("nfs: handle request add failure properly") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20NFS: fix usage of mempools.NeilBrown
When passed GFP flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available. This means that we don't need to handle failure, but that we do need to ensure one thread doesn't call mempool_alloc() twice on the one pool without queuing or freeing the first allocation. If multiple threads did this during times of high memory pressure, the pool could be exhausted and a deadlock could result. pnfs_generic_alloc_ds_commits() attempts to allocate from the nfs_commit_mempool while already holding an allocation from that pool. This is not safe. So change nfs_commitdata_alloc() to take a flag that indicates whether failure is acceptable. In pnfs_generic_alloc_ds_commits(), accept failure and handle it as we currently do. Else where, do not accept failure, and do not handle it. Even when failure is acceptable, we want to succeed if possible. That means both - using an entry from the pool if there is one - waiting for direct reclaim is there isn't. We call mempool_alloc(GFP_NOWAIT) to achieve the first, then kmem_cache_alloc(GFP_NOIO|__GFP_NORETRY) to achieve the second. Each of these can fail, but together they do the best they can without blocking indefinitely. The objects returned by kmem_cache_alloc() will still be freed by mempool_free(). This is safe as mempool_alloc() uses exactly the same function to allocate objects (since the mempool was created with mempool_create_slab_pool()). The object returned by mempool_alloc() and kmem_cache_alloc() are indistinguishable so mempool_free() will handle both identically, either adding to the pool or calling kmem_cache_free(). Also, don't test for failure when allocating from nfs_wdata_mempool. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-17NFS: fix the fault nrequests decreasing for nfs_inode COPYKinglong Mee
The nfs_commit_file for NFSv4.2's COPY operation goes through the commit path for normal WRITE, but without increase nrequests, so, the nrequests decreased in nfs_commit_release_pages is fault. After that, the nrequests will be wrong. [ 5670.299881] ------------[ cut here ]------------ [ 5670.300295] WARNING: CPU: 0 PID: 27656 at fs/nfs/inode.c:127 nfs_clear_inode+0x66/0x90 [nfs] [ 5670.300558] Modules linked in: nfsv4(E) nfs(E) fscache(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event ppdev f2fs coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_ens1371 intel_rapl_perf gameport snd_ac97_codec vmw_balloon ac97_bus snd_seq snd_pcm joydev snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm e1000 crc32c_intel mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: fscache] [ 5670.302925] CPU: 0 PID: 27656 Comm: umount.nfs4 Tainted: G W E 4.11.0-rc1+ #519 [ 5670.303292] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 5670.304094] Call Trace: [ 5670.304510] dump_stack+0x63/0x86 [ 5670.304917] __warn+0xcb/0xf0 [ 5670.305276] warn_slowpath_null+0x1d/0x20 [ 5670.305661] nfs_clear_inode+0x66/0x90 [nfs] [ 5670.306093] nfs4_evict_inode+0x61/0x70 [nfsv4] [ 5670.306480] evict+0xbb/0x1c0 [ 5670.306888] dispose_list+0x4d/0x70 [ 5670.307233] evict_inodes+0x178/0x1a0 [ 5670.307579] generic_shutdown_super+0x44/0xf0 [ 5670.307985] nfs_kill_super+0x21/0x40 [nfs] [ 5670.308325] deactivate_locked_super+0x43/0x70 [ 5670.308698] deactivate_super+0x5a/0x60 [ 5670.309036] cleanup_mnt+0x3f/0x90 [ 5670.309407] __cleanup_mnt+0x12/0x20 [ 5670.309837] task_work_run+0x80/0xa0 [ 5670.310162] exit_to_usermode_loop+0x89/0x90 [ 5670.310497] syscall_return_slowpath+0xaa/0xb0 [ 5670.310875] entry_SYSCALL_64_fastpath+0xa7/0xa9 [ 5670.311197] RIP: 0033:0x7f1bb3617fe7 [ 5670.311545] RSP: 002b:00007ffecbabb828 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6 [ 5670.311906] RAX: 0000000000000000 RBX: 0000000001dca1f0 RCX: 00007f1bb3617fe7 [ 5670.312239] RDX: 000000000000000c RSI: 0000000000000001 RDI: 0000000001dc83c0 [ 5670.312653] RBP: 0000000001dc83c0 R08: 0000000000000001 R09: 0000000000000000 [ 5670.312998] R10: 0000000000000755 R11: 0000000000000206 R12: 00007ffecbabc66a [ 5670.313335] R13: 0000000001dc83a0 R14: 0000000000000000 R15: 0000000000000000 [ 5670.313758] ---[ end trace bf4bfe7764e4eb40 ]--- Cc: linux-kernel@vger.kernel.org Fixes: 67911c8f18 ("NFS: Add nfs_commit_file()") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08nfs: no PG_private waiters remain, remove wakerNicholas Piggin
Since commit 4f52b6bb ("NFS: Don't call COMMIT in ->releasepage()"), no tasks wait on PagePrivate, so the wake introduced in commit 95905446 ("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can be removed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30sunrpc & nfs: Add and use dprintk_cont macrosJoe Perches
Allow line continuations to work properly with KERN_CONT. Signed-off-by: Joe Perches <joe@perches.com> [Anna: Add fallback dprintk_cont() for when CONFIG_SUNRPC_DEBUG=n] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-01NFS: discard nfs_lockowner structure.NeilBrown
It now has only one field and is only used in one structure. So replaced it in that structure by the field it contains. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFS: remove l_pid field from nfs_lockownerNeilBrown
this field is not used in any important way and probably should have been removed by Commit: 8003d3c4aaa5 ("nfs4: treat lock owners as opaque values") which removed the pid argument from nfs4_get_lock_state. Except in unusual and uninteresting cases, two threads with the same ->tgid will have the same ->files pointer, so keeping them both for comparison brings no benefit. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-10-07mm: remove page_file_indexHuang Ying
After using the offset of the swap entry as the key of the swap cache, the page_index() becomes exactly same as page_file_index(). So the page_file_index() is removed and the callers are changed to use page_index() instead. Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-30Merge tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - nfs: don't create zero-length requests - several LAYOUTGET bugfixes Features: - several performance related features - more aggressive caching when we can rely on close-to-open cache consistency - remove serialisation of O_DIRECT reads and writes - optimise several code paths to not flush to disk unnecessarily. However allow for the idiosyncracies of pNFS for those layout types that need to issue a LAYOUTCOMMIT before the metadata can be updated on the server. - SUNRPC updates to the client data receive path - pNFS/SCSI support RH/Fedora dm-mpath device nodes - pNFS files/flexfiles can now use unprivileged ports when the generic NFS mount options allow it. Bugfixes: - Don't use RDMA direct data placement together with data integrity or privacy security flavours - Remove the RDMA ALLPHYSICAL memory registration mode as it has potential security holes. - Several layout recall fixes to improve NFSv4.1 protocol compliance. - Fix an Oops in the pNFS files and flexfiles connection setup to the DS - Allow retry of operations that used a returned delegation stateid - Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding - Fix writeback races in nfs4_copy_range() and nfs42_proc_deallocate()" * tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits) pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding NFSv4: Clean up lookup of SECINFO_NO_NAME NFSv4.2: Fix warning "variable ‘stateids’ set but not used" NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’" SUNRPC: Fix a compiler warning in fs/nfs/clnt.c pNFS: Remove redundant smp_mb() from pnfs_init_lseg() pNFS: Cleanup - do layout segment initialisation in one place pNFS: Remove redundant stateid invalidation pNFS: Remove redundant pnfs_mark_layout_returned_if_empty() pNFS: Clear the layout metadata if the server changed the layout stateid pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid() NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id pNFS: Do not set plh_return_seq for non-callback related layoutreturns pNFS: Ensure layoutreturn acts as a completion for layout callbacks pNFS: Fix CB_LAYOUTRECALL stateid verification pNFS: Always update the layout barrier seqid on LAYOUTGET pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set pNFS: Clear the layout return tracking on layout reinitialisation pNFS: LAYOUTRETURN should only update the stateid if the layout is valid nfs: don't create zero-length requests ...
2016-07-28mm: move most file-based accounting to the nodeMel Gorman
There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-24Merge branch 'writeback'Trond Myklebust
2016-07-22nfs: don't create zero-length requestsBenjamin Coddington
NFS doesn't expect requests with wb_bytes set to zero and may make unexpected decisions about how to handle that request at the page IO layer. Skip request creation if we won't have any wb_bytes in the request. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Weston Andros Adamson <dros@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-19sunrpc: move NO_CRKEY_TIMEOUT to the auth->au_flagsScott Mayhew
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's not really safe to use the the generic_cred->acred->ac_flags to store the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated with the auth_cred to be in a state where they're perpetually doing 4K NFS_FILE_SYNC writes. This can be reproduced as follows: 1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys. They do not need to be the same export, nor do they even need to be from the same NFS server. Also, v3 is fine. $ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5 $ sudo mount -o v3,sec=sys server2:/export /mnt/sys 2. As the normal user, before accessing the kerberized mount, kinit with a short lifetime (but not so short that renewing the ticket would leave you within the 4-minute window again by the time the original ticket expires), e.g. $ kinit -l 10m -r 60m 3. Do some I/O to the kerberized mount and verify that the writes are wsize, UNSTABLE: $ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1 4. Wait until you're within 4 minutes of key expiry, then do some more I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets set. Verify that the writes are 4K, FILE_SYNC: $ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1 5. Now do some I/O to the sec=sys mount. This will cause RPC_CRED_NO_CRKEY_TIMEOUT to be set: $ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1 6. Writes for that user will now be permanently 4K, FILE_SYNC for that user, regardless of which mount is being written to, until you reboot the client. Renewing the kerberos ticket (assuming it hasn't already expired) will have no effect. Grabbing a new kerberos ticket at this point will have no effect either. Move the flag to the auth->au_flags field (which is currently unused) and rename it slightly to reflect that it's no longer associated with the auth_cred->ac_flags. Add the rpc_auth to the arg list of rpcauth_cred_key_to_expire and check the au_flags there too. Finally, add the inode to the arg list of nfs_ctx_key_to_expire so we can determine the rpc_auth to pass to rpcauth_cred_key_to_expire. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFSv4.2: Fix writeback races in nfs4_copy_file_rangeTrond Myklebust
We need to ensure that any writes to the destination file are serialised with the copy, meaning that the writeback has to occur under the inode lock. Also relax the writeback requirement on the source, and rely on the stateid checking to tell us if the source rebooted. Add the helper nfs_filemap_write_and_wait_range() to call pnfs_sync_inode() as is appropriate for pNFS servers that may need a layoutcommit. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFS: Fix O_DIRECT verifier problemsTrond Myklebust
We should not be interested in looking at the value of the stable field, since that could take any value. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22NFS: writepage of a single page should not be synchronousTrond Myklebust
It is almost always better to wait for more so that we can issue a bulk commit. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22NFS: Kill NFS_INO_NFS_INO_FLUSHING: it is a performance killerTrond Myklebust
filemap_datawrite() and friends already deal just fine with livelock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-05-25nfs: avoid race that crashes nfs_init_commitWeston Andros Adamson
Since the patch "NFS: Allow multiple commit requests in flight per file" we can run multiple simultaneous commits on the same inode. This introduced a race over collecting pages to commit that made it possible to call nfs_init_commit() with an empty list - which causes crashes like the one below. The fix is to catch this race and avoid calling nfs_init_commit and initiate_commit when there is no work to do. Here is the crash: [600522.076832] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [600522.078475] IP: [<ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs] [600522.078745] PGD 4272b1067 PUD 4272cb067 PMD 0 [600522.078972] Oops: 0000 [#1] SMP [600522.079204] Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dcdbas ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw vmw_vsock_vmci_transport vsock bonding ipmi_devintf ipmi_msghandler coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev vmw_balloon parport_pc parport acpi_cpufreq vmw_vmci i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel serio_raw vmxnet3 [600522.081380] vmw_pvscsi ata_generic pata_acpi [600522.081809] CPU: 3 PID: 15667 Comm: /usr/bin/python Not tainted 4.1.9-100.pd.88.el7.x86_64 #1 [600522.082281] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014 [600522.082814] task: ffff8800bbbfa780 ti: ffff88042ae84000 task.ti: ffff88042ae84000 [600522.083378] RIP: 0010:[<ffffffffa0479e72>] [<ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs] [600522.083973] RSP: 0018:ffff88042ae87438 EFLAGS: 00010246 [600522.084571] RAX: 0000000000000000 RBX: ffff880003485e40 RCX: ffff88042ae87588 [600522.085188] RDX: 0000000000000000 RSI: ffff88042ae874b0 RDI: ffff880003485e40 [600522.085756] RBP: ffff88042ae87448 R08: ffff880003486010 R09: ffff88042ae874b0 [600522.086332] R10: 0000000000000000 R11: 0000000000000005 R12: ffff88042ae872d0 [600522.086905] R13: ffff88042ae874b0 R14: ffff880003485e40 R15: ffff88042704c840 [600522.087484] FS: 00007f4728ff2740(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000 [600522.088070] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [600522.088663] CR2: 0000000000000040 CR3: 000000042b6aa000 CR4: 00000000001406e0 [600522.089327] Stack: [600522.089926] 0000000000000001 ffff88042ae87588 ffff88042ae874f8 ffffffffa04f09fa [600522.090549] 0000000000017840 0000000000017840 ffff88042ae87588 ffff8803258d9930 [600522.091169] ffff88042ae87578 ffffffffa0563d80 0000000000000000 ffff88042704c840 [600522.091789] Call Trace: [600522.092420] [<ffffffffa04f09fa>] pnfs_generic_commit_pagelist+0x1da/0x320 [nfsv4] [600522.093052] [<ffffffffa0563d80>] ? ff_layout_commit_prepare_v3+0x30/0x30 [nfs_layout_flexfiles] [600522.093696] [<ffffffffa0562645>] ff_layout_commit_pagelist+0x15/0x20 [nfs_layout_flexfiles] [600522.094359] [<ffffffffa047bc78>] nfs_generic_commit_list+0xe8/0x120 [nfs] [600522.095032] [<ffffffffa047bd6a>] nfs_commit_inode+0xba/0x110 [nfs] [600522.095719] [<ffffffffa046ac54>] nfs_release_page+0x44/0xd0 [nfs] [600522.096410] [<ffffffff811a8122>] try_to_release_page+0x32/0x50 [600522.097109] [<ffffffff811bd4f1>] shrink_page_list+0x961/0xb30 [600522.097812] [<ffffffff811bdced>] shrink_inactive_list+0x1cd/0x550 [600522.098530] [<ffffffff811bea65>] shrink_lruvec+0x635/0x840 [600522.099250] [<ffffffff811bed60>] shrink_zone+0xf0/0x2f0 [600522.099974] [<ffffffff811bf312>] do_try_to_free_pages+0x192/0x470 [600522.100709] [<ffffffff811bf6ca>] try_to_free_pages+0xda/0x170 [600522.101464] [<ffffffff811b2198>] __alloc_pages_nodemask+0x588/0x970 [600522.102235] [<ffffffff811fbbd5>] alloc_pages_vma+0xb5/0x230 [600522.103000] [<ffffffff813a1589>] ? cpumask_any_but+0x39/0x50 [600522.103774] [<ffffffff811d6115>] wp_page_copy.isra.55+0x95/0x490 [600522.104558] [<ffffffff810e3438>] ? __wake_up+0x48/0x60 [600522.105357] [<ffffffff811d7d3b>] do_wp_page+0xab/0x4f0 [600522.106137] [<ffffffff810a1bbb>] ? release_task+0x36b/0x470 [600522.106902] [<ffffffff8126dbd7>] ? eventfd_ctx_read+0x67/0x1c0 [600522.107659] [<ffffffff811da2a8>] handle_mm_fault+0xc78/0x1900 [600522.108431] [<ffffffff81067ef1>] __do_page_fault+0x181/0x420 [600522.109173] [<ffffffff811446a6>] ? __audit_syscall_exit+0x1e6/0x280 [600522.109893] [<ffffffff810681c0>] do_page_fault+0x30/0x80 [600522.110594] [<ffffffff81024f36>] ? syscall_trace_leave+0xc6/0x120 [600522.111288] [<ffffffff81790a58>] page_fault+0x28/0x30 [600522.111947] Code: 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 4c 8d 87 d0 01 00 00 48 89 e5 53 48 89 fb 48 83 ec 08 4c 8b 0e 49 8b 41 18 4c 39 ce <48> 8b 40 40 4c 8b 50 30 74 24 48 8b 87 d0 01 00 00 48 8b 7e 08 [600522.113343] RIP [<ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs] [600522.114003] RSP <ffff88042ae87438> [600522.114636] CR2: 0000000000000040 Fixes: af7cf057 (NFS: Allow multiple commit requests in flight per file) CC: stable@vger.kernel.org Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-25NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()Dan Carpenter
nfs_create_request() doesn't return NULL, it returns error pointers. Fixes: 67911c8f18b5 ('NFS: Add nfs_commit_file()') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17NFS: Reclaim writes via writepage are opportunisticTrond Myklebust
No need to make them a priority any more, or to make them succeed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17NFS: Add nfs_commit_file()Anna Schumaker
Copy will use this to set up a commit request for a generic range. I don't want to allocate a new pagecache entry for the file, so I needed to change parts of the commit path to handle requests with a null wb_page. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09NFS: Save struct inode * inside nfs_commit_info to clarify usage of i_lockDave Wysochanski
Commit ea2cf22 created nfs_commit_info and saved &inode->i_lock inside this NFS specific structure. This obscures the usage of i_lock. Instead, save struct inode * so later it's clear the spinlock taken is i_lock. Should be no functional change. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21NFS: Simplify nfs_request_add_commit_list() argumentsAnna Schumaker
I noticed that all the callers of this function pass cinfo->mds->list as an argument in addition to the cinfo structure itself. Let's get rid of the extra argument, since it doesn't seem to be adding anything. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-07Merge branch 'bugfixes'Trond Myklebust
* bugfixes: SUNRPC: Fixup socket wait for memory SUNRPC: Fix a missing break in rpc_anyaddr() pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh() NFS: Fix attribute cache revalidation NFS: Ensure we revalidate attributes before using execute_ok() NFS: Flush reclaim writes using FLUSH_COND_STABLE NFS: Background flush should not be low priority NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn NFSv4: Don't perform cached access checks before we've OPENed the file NFS: Allow the combination pNFS and labeled NFS NFS42: handle layoutstats stateid error nfs: Fix race in __update_open_stateid() nfs: fix missing assignment in nfs4_sequence_done tracepoint
2016-01-07NFS: Use wait_on_atomic_t() for unlock after readaheadBenjamin Coddington
The use of wait_on_atomic_t() for waiting on I/O to complete before unlocking allows us to git rid of the NFS_IO_INPROGRESS flag, and thus the nfs_iocounter's flags member, and finally the nfs_iocounter altogether. The count of I/O is moved to the lock context, and the counter increment/decrement functions become simple enough to open-code. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> [Trond: Fix up conflict with existing function nfs_wait_atomic_killable()] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-04Merge branch 'pnfs_generic'Trond Myklebust
* pnfs_generic: NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid() NFSv4.1/pNFS: Fix a race in initiate_file_draining() NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn() NFS: Relax requirements in nfs_flush_incompatible NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid NFS: Allow multiple commit requests in flight per file NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs NFSv4: List stateid information in the callback tracepoints NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1 pNFS: If we have to delay the layout callback, mark the layout for return NFSv4.1/pNFS: Add a helper to mark the layout as returned pNFS: Ensure nfs4_layoutget_prepare returns the correct error
2015-12-31NFS: Relax requirements in nfs_flush_incompatibleTrond Myklebust
If two processes share the same credentials and NFSv4 open stateid, then allow them both to dirty the same page, even if their nfs_open_context differs. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-31NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalidTrond Myklebust
If the layout segment is invalid, then we should not be adding more write requests to the commit list. Instead, those writes should be replayed after requesting a new layout. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-31NFS: Allow multiple commit requests in flight per fileTrond Myklebust
Allow synchronous RPC calls to wait for pending RPC calls to finish, but also allow asynchronous ones to just fire off another commit. With this patch, the xfstests generic/074 test completes in 226s instead of 242s Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-31NFS/pNFS: Fix up pNFS write reschedule layering violations and bugsTrond Myklebust
The flexfiles layout in particular, seems to want to poke around in the O_DIRECT flags when retransmitting. This patch sets up an interface to allow it to call back into O_DIRECT to handle retransmission correctly. It also fixes a potential bug whereby we could change the behaviour of O_DIRECT if an error is already pending. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-28nfs: only remove page from mapping if launder_page failsPeng Tao
Instead of dropping pages when write fails, only do it when we get fatal failure in launder_page write back. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-28nfs: handle request add failure properlyPeng Tao
When we fail to queue a read page to IO descriptor, we need to clean it up otherwise it is hanging around preventing nfs module from being removed. When we fail to queue a write page to IO descriptor, we need to clean it up and also save the failure status to open context. Then at file close, we can try to write pages back again and drop the page if it fails to writeback in .launder_page, which will be done in the next patch. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-28NFS: Flush reclaim writes using FLUSH_COND_STABLETrond Myklebust
If there are already writes queued up for commit, then don't flush just this page even if it is a reclaim issue. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>