aboutsummaryrefslogtreecommitdiffstats
path: root/fs/nfs
AgeCommit message (Collapse)Author
2016-05-09NFS: Fix an LOCK/OPEN race when unlinking an open fileChuck Lever
At Connectathon 2016, we found that recent upstream Linux clients would occasionally send a LOCK operation with a zero stateid. This appeared to happen in close proximity to another thread returning a delegation before unlinking the same file while it remained open. Earlier, the client received a write delegation on this file and returned the open stateid. Now, as it is getting ready to unlink the file, it returns the write delegation. But there is still an open file descriptor on that file, so the client must OPEN the file again before it returns the delegation. Since commit 24311f884189 ('NFSv4: Recovery of recalled read delegations is broken'), nfs_open_delegation_recall() clears the NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a racing LOCK on the same inode to be put on the wire before the OPEN operation has returned a valid open stateid. To eliminate this race, serialize delegation return with the acquisition of a file lock on the same file. Adopt the same approach as is used in the unlock path. This patch also eliminates a similar race seen when sending a LOCK operation at the same time as returning a delegation on the same file. Fixes: 24311f884189 ('NFSv4: Recovery of recalled read ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna: Add sentence about LOCK / delegation race] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: have flexfiles mirror keep creds for both ro and rw layoutsJeff Layton
A mirror can be shared between multiple layouts, even with different iomodes. That makes stats gathering simpler, but it causes a problem when we get different creds in READ vs. RW layouts. The current code drops the newer credentials onto the floor when this occurs. That's problematic when you fetch a READ layout first, and then a RW. If the READ layout doesn't have the correct creds to do a write, then writes will fail. We could just overwrite the READ credentials with the RW ones, but that would break the ability for the server to fence the layout for reads if things go awry. We need to be able to revert to the earlier READ creds if the RW layout is returned afterward. The simplest fix is to just keep two sets of creds per mirror. One for READ layouts and one for RW, and then use the appropriate set depending on the iomode of the layout segment. Also fix up some RCU nits that sparse found. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: get a reference to the credential in ff_layout_alloc_lsegJeff Layton
We're just as likely to have allocation problems here as we would if we delay looking up the credential like we currently do. Fix the code to get a rpc_cred reference early, as soon as the mirror is set up. This allows us to eliminate the mirror early if there is a problem getting an rpc credential. This also allows us to drop the uid/gid from the layout_mirror struct as well. In the event that we find an existing mirror where this one would go, we swap in the new creds unconditionally, and drop the reference to the old one. Note that the old ff_layout_update_mirror_cred function wouldn't set this pointer unless the DS version was 3, but we don't know what the DS version is at this point. I'm a little unclear on why it did that as you still need creds to talk to v4 servers as well. I have the code set it regardless of the DS version here. Also note the change to using generic creds instead of calling lookup_cred directly. With that change, we also need to populate the group_info pointer in the acred as some functions expect that to never be NULL. Instead of allocating one every time however, we can allocate one when the module is loaded and share it since the group_info is refcounted. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: have ff_layout_get_ds_cred take a reference to the credJeff Layton
In later patches, we're going to want to allow the creds to be updated when we get a new layout with updated creds. Have this function take a reference to the cred that is later put once the call has been dispatched. Also, prepare for this change by ensuring we follow RCU rules when getting a reference to the cred as well. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: don't call nfs4_ff_layout_prepare_ds from ff_layout_get_ds_credJeff Layton
All the callers already call that function before calling into here, so it ends up being a no-op anyway. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09NFS: Save struct inode * inside nfs_commit_info to clarify usage of i_lockDave Wysochanski
Commit ea2cf22 created nfs_commit_info and saved &inode->i_lock inside this NFS specific structure. This obscures the usage of i_lock. Instead, save struct inode * so later it's clear the spinlock taken is i_lock. Should be no functional change. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: add debug to directio "good_bytes" countingWeston Andros Adamson
This will pop a warning if we count too many good bytes. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09pnfs: set NFS_IOHDR_REDO in pnfs_read_resend_pnfsWeston Andros Adamson
Like other resend paths, mark the (old) hdr as NFS_IOHDR_REDO. This ensures the hdr completion function will not count the (old) hdr as good bytes. Also, vector the error back through the hdr->task.tk_status like other retry calls. This fixes a bug with the FlexFiles layout where libaio was reporting more bytes read than requested. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02nfs: switch to ->iterate_shared()Al Viro
aside of the usual care about seeding dcache from readdir, we need to be careful about the pagecache evictions here. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02nfs: missing wakeup in nfs_unblock_sillyrename()Al Viro
will be needed as soon as lookups are not serialized by ->i_mutex Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02Merge getxattr prototype change into work.lookupsAl Viro
The rest of work.xattr stuff isn't needed for this branch
2016-05-01fs: simplify the generic_write_sync prototypeChristoph Hellwig
The kiocb already has the new position, so use that. The only interesting case is AIO, where we currently don't bother updating ki_pos. We're about to free the kiocb after we're done, so we might as well update it to make everyone's life simpler. While we're at it also return the bytes written argument passed in if we were successful so that the boilerplate error switch code in the callers can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01fs: add IOCB_SYNC and IOCB_DSYNCChristoph Hellwig
This will allow us to do per-I/O sync file writes, as required by a lot of fileservers or storage targets. XXX: Will need a few additional audits for O_DSYNC Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01direct-io: eliminate the offset argument to ->direct_IOChristoph Hellwig
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually work, so eliminate the superflous argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-11KEYS: Add a facility to restrict new links into a keyringDavid Howells
Add a facility whereby proposed new links to be added to a keyring can be vetted, permitting them to be rejected if necessary. This can be used to block public keys from which the signature cannot be verified or for which the signature verification fails. It could also be used to provide blacklisting. This affects operations like add_key(), KEYCTL_LINK and KEYCTL_INSTANTIATE. To this end: (1) A function pointer is added to the key struct that, if set, points to the vetting function. This is called as: int (*restrict_link)(struct key *keyring, const struct key_type *key_type, unsigned long key_flags, const union key_payload *key_payload), where 'keyring' will be the keyring being added to, key_type and key_payload will describe the key being added and key_flags[*] can be AND'ed with KEY_FLAG_TRUSTED. [*] This parameter will be removed in a later patch when KEY_FLAG_TRUSTED is removed. The function should return 0 to allow the link to take place or an error (typically -ENOKEY, -ENOPKG or -EKEYREJECTED) to reject the link. The pointer should not be set directly, but rather should be set through keyring_alloc(). Note that if called during add_key(), preparse is called before this method, but a key isn't actually allocated until after this function is called. (2) KEY_ALLOC_BYPASS_RESTRICTION is added. This can be passed to key_create_or_update() or key_instantiate_and_link() to bypass the restriction check. (3) KEY_FLAG_TRUSTED_ONLY is removed. The entire contents of a keyring with this restriction emplaced can be considered 'trustworthy' by virtue of being in the keyring when that keyring is consulted. (4) key_alloc() and keyring_alloc() take an extra argument that will be used to set restrict_link in the new key. This ensures that the pointer is set before the key is published, thus preventing a window of unrestrictedness. Normally this argument will be NULL. (5) As a temporary affair, keyring_restrict_trusted_only() is added. It should be passed to keyring_alloc() as the extra argument instead of setting KEY_FLAG_TRUSTED_ONLY on a keyring. This will be replaced in a later patch with functions that look in the appropriate places for authoritative keys. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2016-04-10xattr_handler: pass dentry and inode as separate arguments of ->get()Al Viro
... and do not assume they are already attached to each other Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10don't bother with ->d_inode->i_sb - it's always equal to ->d_sbAl Viro
... and neither can ever be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-07Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "These changes contains a fix for overlayfs interacting with some (badly behaved) dentry code in various file systems. These have been reviewed by Al and the respective file system mtinainers and are going through the ext4 tree for convenience. This also has a few ext4 encryption bug fixes that were discovered in Android testing (yes, we will need to get these sync'ed up with the fs/crypto code; I'll take care of that). It also has some bug fixes and a change to ignore the legacy quota options to allow for xfstests regression testing of ext4's internal quota feature and to be more consistent with how xfs handles this case" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: ignore quota mount options if the quota feature is enabled ext4 crypto: fix some error handling ext4: avoid calling dquot_get_next_id() if quota is not enabled ext4: retry block allocation for failed DIO and DAX writes ext4: add lockdep annotations for i_data_sem ext4: allow readdir()'s of large empty directories to be interrupted btrfs: fix crash/invalid memory access on fsync when using overlayfs ext4 crypto: use dget_parent() in ext4_d_revalidate() ext4: use file_dentry() ext4: use dget_parent() in ext4_file_open() nfs: use file_dentry() fs: add file_dentry() ext4 crypto: don't let data integrity writebacks fail with ENOMEM ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-31posix_acl: Inode acl caching fixesAndreas Gruenbacher
When get_acl() is called for an inode whose ACL is not cached yet, the get_acl inode operation is called to fetch the ACL from the filesystem. The inode operation is responsible for updating the cached acl with set_cached_acl(). This is done without locking at the VFS level, so another task can call set_cached_acl() or forget_cached_acl() before the get_acl inode operation gets to calling set_cached_acl(), and then get_acl's call to set_cached_acl() results in caching an outdate ACL. Prevent this from happening by setting the cached ACL pointer to a task-specific sentinel value before calling the get_acl inode operation. Move the responsibility for updating the cached ACL from the get_acl inode operations to get_acl(). There, only set the cached ACL if the sentinel value hasn't changed. The sentinel values are chosen to have odd values. Likewise, the value of ACL_NOT_CACHED is odd. In contrast, ACL object pointers always have an even value (ACLs are aligned in memory). This allows to distinguish uncached ACLs values from ACL objects. In addition, switch from guarding inode->i_acl and inode->i_default_acl upates by the inode->i_lock spinlock to using xchg() and cmpxchg(). Filesystems that do not want ACLs returned from their get_acl inode operations to be cached must call forget_cached_acl() to prevent the VFS from doing so. (Patch written by Al Viro and Andreas Gruenbacher.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-26nfs: use file_dentry()Miklos Szeredi
NFS may be used as lower layer of overlayfs and accessing f_path.dentry can lead to a crash. Fix by replacing direct access of file->f_path.dentry with the file_dentry() accessor, which will always return a native object. Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: <stable@vger.kernel.org> # v4.2 Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-03-24Merge tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull more nfsd updates from Bruce Fields: "Apologies for the previous request, which omitted the top 8 commits from my for-next branch (including the SCSI layout commits). Thanks to Trond for spotting my error!" This actually includes the new layout types, so here's that part of the pull message repeated: "Support for a new pnfs layout type from Christoph Hellwig. The new layout type is a variant of the block layout which uses SCSI features to offer improved fencing and device identification. Note this pull request also includes the client side of SCSI layout, with Trond's permission" * tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux: nfsd: use short read as well as i_size to set eof nfsd: better layoutupdate bounds-checking nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK nfsd: add SCSI layout support nfsd: move some blocklayout code nfsd: add a new config option for the block layout driver nfs/blocklayout: add SCSI layout support nfs4.h: add SCSI layout definitions
2016-03-22Merge tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Add support for multiple NFSv4.1 callbacks in flight - Initial patchset for RPC multipath support - Adapt RPC/RDMA to use the new completion queue API Bugfixes and cleanups: - nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed - Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync - Fix RPC/RDMA credit accounting - Properly handle RDMA_ERROR replies - xprtrdma: Do not wait if ib_post_send() fails - xprtrdma: Segment head and tail XDR buffers on page boundaries - xprtrdma cleanups for dprintk, physical_op_map and unused macros" * tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (35 commits) nfs/blocklayout: make sure making a aligned read request nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed nfs: remove nfs_inode_dio_wait nfs: remove nfs4_file_fsync xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs xprtrdma: Use an anonymous union in struct rpcrdma_mw xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs xprtrdma: Serialize credit accounting again xprtrdma: Properly handle RDMA_ERROR replies rpcrdma: Add RPCRDMA_HDRLEN_ERR xprtrdma: Do not wait if ib_post_send() fails xprtrdma: Segment head and tail XDR buffers on page boundaries xprtrdma: Clean up dprintk format string containing a newline xprtrdma: Clean up physical_op_map() xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath ...
2016-03-21nfs/blocklayout: make sure making a aligned read requestKinglong Mee
Only treat write goes up to the inode size as aligned request, because it always write PAGE_CACHE_SIZE, but read a dynamic size. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-03-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: - Preparations of parallel lookups (the remaining main obstacle is the need to move security_d_instantiate(); once that becomes safe, the rest will be a matter of rather short series local to fs/*.c - preadv2/pwritev2 series from Christoph - assorted fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits) splice: handle zero nr_pages in splice_to_pipe() vfs: show_vfsstat: do not ignore errors from show_devname method dcache.c: new helper: __d_add() don't bother with __d_instantiate(dentry, NULL) untangle fsnotify_d_instantiate() a bit uninline d_add() replace d_add_unique() with saner primitive quota: use lookup_one_len_unlocked() cifs_get_root(): use lookup_one_len_unlocked() nfs_lookup: don't bother with d_instantiate(dentry, NULL) kill dentry_unhash() ceph_fill_trace(): don't bother with d_instantiate(dn, NULL) autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup() configfs: move d_rehash() into configfs_create() for regular files ceph: don't bother with d_rehash() in splice_dentry() namei: teach lookup_slow() to skip revalidate namei: massage lookup_slow() to be usable by lookup_one_len_unlocked() lookup_one_len_unlocked(): use lookup_dcache() namei: simplify invalidation logics in lookup_dcache() namei: change calling conventions for lookup_{fast,slow} and follow_managed() ...
2016-03-18nfs/blocklayout: add SCSI layout supportChristoph Hellwig
This is a trivial extension to the block layout driver to support the new SCSI layouts draft. There are three changes: - device identifcation through the SCSI VPD page. This allows us to directly use the udev generated persistent device names instead of requiring an expensive lookup by crawling every block device node in /dev and reading a signature for it. - use of SCSI persistent reservations to protect device access and allow for robust fencing. On the client sides this just means registering and unregistering a server supplied key. - an optimized LAYOUTCOMMIT payload that doesn't send unessecary fields to the server. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-16nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failedJeff Layton
I hit the following oops out of the blue while testing with flexfiles: BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8 IP: [<ffffffffa048f6b8>] nfs4_ff_find_or_create_ds_client+0x48/0x50 [nfs_layout_flexfiles] PGD 44031067 PUD 5062d067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: nfsv3 nfs_layout_flexfiles tun rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dcdbas nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bonding ipmi_devintf ipmi_msghandler snd_hda_codec_generic virtio_balloon ppdev snd_hda_intel snd_hda_controller snd_hda_codec iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core parport_pc snd_hwdep parport snd_seq snd_seq_device snd_pcm snd_timer acpi_cpufreq snd soundcore i2c_piix4 xfs libcrc32c joydev virtio_net virtio_console qxl drm_kms_helper ttm crc32c_intel drm virtio_pci serio_raw ata_generic virtio_ring virtio pata_acpi CPU: 0 PID: 19138 Comm: test5 Not tainted 4.1.9-100.pd.90.el7.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 task: ffff88007b70cf00 ti: ffff88004cc44000 task.ti: ffff88004cc44000 RIP: 0010:[<ffffffffa048f6b8>] [<ffffffffa048f6b8>] nfs4_ff_find_or_create_ds_client+0x48/0x50 [nfs_layout_flexfiles] RSP: 0018:ffff88004cc47890 EFLAGS: 00010246 RAX: 0000000000000003 RBX: ffff880050932300 RCX: ffff88006978f488 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003e0e8540 RBP: ffff88004cc47908 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88007ff8c758 R11: 0000000000000005 R12: ffff88003e0e8540 R13: 0000000000000000 R14: ffff88006978f488 R15: ffff88004431cc80 FS: 00007fea40c7c740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e8 CR3: 0000000044318000 CR4: 00000000000406f0 Stack: ffffffffa048c934 ffff880050932310 0000000100000001 ffff88006978f510 ffff88006978f3c8 ffff88003e56cd90 ffff88004cc479d0 00000020a052aff0 000000000004b000 ffff88004cc47908 ffff880050932300 ffff88004cc479d0 Call Trace: [<ffffffffa048c934>] ? ff_layout_write_pagelist+0x64/0x220 [nfs_layout_flexfiles] [<ffffffffa057a3bf>] pnfs_generic_pg_writepages+0xaf/0x1b0 [nfsv4] [<ffffffffa051ab57>] nfs_pageio_doio+0x27/0x60 [nfs] [<ffffffffa051bfe4>] nfs_pageio_complete_mirror+0x54/0xa0 [nfs] [<ffffffffa051c7ad>] nfs_pageio_complete+0x2d/0x90 [nfs] [<ffffffffa052032d>] nfs_writepage_locked+0x8d/0xe0 [nfs] [<ffffffff811e4630>] ? page_referenced_one+0x1a0/0x1a0 [<ffffffffa05210e7>] nfs_wb_single_page+0xf7/0x190 [nfs] [<ffffffffa05108d1>] nfs_launder_page+0x41/0x90 [nfs] [<ffffffff811b8930>] invalidate_inode_pages2_range+0x340/0x3a0 [<ffffffff811b89a7>] invalidate_inode_pages2+0x17/0x20 [<ffffffffa0513e1e>] nfs_release+0x9e/0xb0 [nfs] [<ffffffffa050fa1d>] nfs_file_release+0x3d/0x60 [nfs] [<ffffffff8122481c>] __fput+0xdc/0x1e0 [<ffffffff8122496e>] ____fput+0xe/0x10 [<ffffffff810bde67>] task_work_run+0xa7/0xe0 [<ffffffff810af735>] get_signal+0x565/0x600 [<ffffffff811a9815>] ? __filemap_fdatawrite_range+0x65/0x90 [<ffffffff810144a7>] do_signal+0x37/0x730 [<ffffffffa0569921>] ? nfs4_file_fsync+0x81/0x150 [nfsv4] [<ffffffff81254dbb>] ? vfs_fsync_range+0x3b/0xb0 [<ffffffff811446a6>] ? __audit_syscall_exit+0x1e6/0x280 [<ffffffff81014bff>] do_notify_resume+0x5f/0xa0 [<ffffffff8178ec3c>] int_signal+0x12/0x17 Code: 48 8b 40 70 8b 00 83 f8 03 74 20 83 f8 04 75 13 55 48 89 ce 48 89 d7 48 89 e5 e8 14 0f 0e 00 5d c3 66 90 0f 0b 66 0f 1f 44 00 00 <48> 8b 82 e8 00 00 00 c3 66 66 66 66 90 55 48 89 e5 41 57 41 56 RIP [<ffffffffa048f6b8>] nfs4_ff_find_or_create_ds_client+0x48/0x50 [nfs_layout_flexfiles] RSP <ffff88004cc47890> CR2: 00000000000000e8 When the DS connection attempt fails, nfs4_ff_layout_prepare_ds marks it for the error but then just returns the ds as if it were usable. The comments though say: /* Upon return, either ds is connected, or ds is NULL */ Ensure that we set the return pointer to NULL in the event that the connection attempt fails. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-03-16nfs: remove nfs_inode_dio_waitChristoph Hellwig
Just call inode_dio_wait directly instead of through a pointless wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-03-16nfs: remove nfs4_file_fsyncChristoph Hellwig
The only difference to nfs_file_fsync is the call to pnfs_sync_inode. But pnfs_sync_inode is just an inline that calls a pNFS layout driver method if CONFIG_PNFS is designed, and thus can be called just fine from the core NFS module. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-03-14replace d_add_unique() with saner primitiveAl Viro
new primitive: d_exact_alias(dentry, inode). If there is an unhashed dentry with the same name/parent and given inode, rehash, grab and return it. Otherwise, return NULL. The only caller of d_add_unique() switched to d_exact_alias() + d_splice_alias(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-14nfs_lookup: don't bother with d_instantiate(dentry, NULL)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-22Merge branch 'multipath'Trond Myklebust
* multipath: NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports SUNRPC: Allow caller to specify the transport to use SUNRPC: Use the multipath iterator to assign a transport to each task SUNRPC: Make rpc_clnt store the multipath iterators SUNRPC: Add a structure to track multiple transports SUNRPC: Make freeing of struct xprt rcu-safe SUNRPC: Uninline xprt_get(); It isn't performance critical. SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines SUNRPC: Remove unused function rpc_task_reset_client
2016-02-22Merge branch 'nfsv41_cb'Trond Myklebust
* nfsv41_cb: NFSv4.x: Fix NFS4ERR_RETRY_UNCACHED_REP in nfs4_callback_sequence NFSv4.x: Allow multiple callbacks in flight NFSv4.x: Fix wraparound issues when validing the callback sequence id NFSv4.x: Enforce the ca_maxresponsesize_cached on the back channel NFSv4.x: CB_SEQUENCE should return NFS4ERR_DELAY if still executing NFSv4.x: Remove hard coded slotids in callback channel
2016-02-22NFSv4.x/pnfs: Fix a race between layoutget and bulk recallsTrond Myklebust
Replace another case where the layout 'plh_block_lgets' can trigger infinite loops in send_layoutget(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-22NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layoutTrond Myklebust
If the server reboots while there is a layoutget outstanding, then the call to pnfs_choose_layoutget_stateid() will fail with an EAGAIN error, which causes an infinite loop in send_layoutget(). The reason why we never break out of the loop is that the layout 'plh_block_lgets' field is never cleared. Fix is to replace plh_block_lgets with NFS_LAYOUT_INVALID_STID, which can be reset after a new layoutget. Fixes: ab7d763e477c5 ("pNFS: Ensure nfs4_layoutget_prepare returns...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_pageKinglong Mee
unreferenced object 0xffffc90000abf000 (size 16900): comm "fsync02", pid 15765, jiffies 4297431627 (age 423.772s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 a0 c2 19 00 88 ff ff ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8174d54e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811b9b91>] __vmalloc_node_range+0x231/0x280 [<ffffffff811b9c2a>] __vmalloc+0x4a/0x50 [<ffffffffa02c9ec1>] ext_tree_prepare_commit+0x231/0x2e0 [blocklayoutdriver] [<ffffffffa02c700e>] bl_prepare_layoutcommit+0xe/0x10 [blocklayoutdriver] [<ffffffffa0596a6c>] pnfs_layoutcommit_inode+0x29c/0x330 [nfsv4] [<ffffffffa0596b13>] pnfs_generic_sync+0x13/0x20 [nfsv4] [<ffffffffa0585188>] nfs4_file_fsync+0x58/0x150 [nfsv4] [<ffffffff81228e5b>] vfs_fsync_range+0x4b/0xb0 [<ffffffff81228f1d>] do_fsync+0x3d/0x70 [<ffffffff812291d0>] SyS_fsync+0x10/0x20 [<ffffffff81757def>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff v2, add missing include header Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17nfs4: fix stateid handling for the NFS v4.2 operationsChristoph Hellwig
The newly added NFS v4.2 operations (ALLOCATE, DEALLOCATE, SEEK and CLONE) use a helper called nfs42_set_rw_stateid to select a stateid that is sent to the server. But they don't set the inode and state fields in the nfs4_exception structure, and this don't partake in the stateid recovery protocol. Because of this they will simply return errors insted of trying to recover a stateid when the server return a BAD_STATEID error. Additionally CLONE has the problem that it operates on two files and thus two stateids, and thus needs to call the exception handler twice to recover stateids. While we're at it stop grabbing an addititional reference to the open context in all these operations - having the file open guarantees that the open context won't go away. All this can be produces with the generic/168 and generic/170 tests in xfstests which stress the CLONE stateid handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17NFSv4: Fix a dentry leak on alias useBenjamin Coddington
In the case where d_add_unique() finds an appropriate alias to use it will have already incremented the reference count. An additional dget() to swap the open context's dentry is unnecessary and will leak a reference. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 275bb307865a3 ("NFSv4: Move dentry instantiation into the NFSv4-...") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-15pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomodeTrond Myklebust
When setting the layout return mode, we must always also set the NFS_LAYOUT_RETURN_REQUESTED flag to ensure that we send a layoutreturn. Otherwise pnfs_error_mark_layout_for_return() could set the mode, but fail to send the layoutreturn because another is already in flight. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-15pNFS: Fix pnfs_mark_matching_lsegs_return()Trond Myklebust
We don't need to schedule a layoutreturn if the layout segment can be freed immediately. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05NFS add callback_ops to nfs4_proc_bind_conn_to_session_callbackAndy Adamson
Fix oops when NULL callback_ops pointer accessed in rpc_init_task Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3Trond Myklebust
This adds multipathing to pNFS over NFSv3 as described in the flexfiles draft spec. Ideally, we'd like to do the same for pNFS files, but the NFSv4.1 protocol requires a call to EXCHANGE_ID in order to test that the connection can do session trunking. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connectionsTrond Myklebust
Use the new helper to ensure that nfs4_proc_bind_conn_to_session() is called for all connections. However ensure that we only set the backchannel flag for the connection pointed to by rpc_clnt->cl_xprt. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-01NFSv4.x: Fix NFS4ERR_RETRY_UNCACHED_REP in nfs4_callback_sequenceTrond Myklebust
We need to initialize cb_sequenceres information when reporting a NFS4ERR_RETRY_UNCACHED_REP error, since that will apply to the next operation, not to the CB_SEQUENCE itself. Reported-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-27NFS: Cleanup - rename NFS_LAYOUT_RETURN_BEFORE_CLOSETrond Myklebust
NFS_LAYOUT_RETURN_BEFORE_CLOSE is being used to signal that a layoutreturn is needed, either due to a layout recall or to a layout error. Rename it to NFS_LAYOUT_RETURN_REQUESTED in order to clarify its purpose. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-26pNFS: Fix missing layoutreturn callsTrond Myklebust
The layoutreturn code currently relies on pnfs_put_lseg() to initiate the RPC call when conditions are right. A problem arises when we want to free the layout segment from inside an inode->i_lock section (e.g. in pnfs_clear_request_commit()), since we cannot sleep. The workaround is to move the actual call to pnfs_send_layoutreturn() to pnfs_put_layout_hdr(), which doesn't have this restriction. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-25NFSv4.x: Allow multiple callbacks in flightTrond Myklebust
Hook the callback channel into the same session management machinery as we use for the forward channel. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-24NFSv4.x: Fix wraparound issues when validing the callback sequence idTrond Myklebust
We need to make sure that we don't allow args->csa_sequenceid == 0. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-24NFSv4.x: Enforce the ca_maxresponsesize_cached on the back channelTrond Myklebust
We have no duplicate reply cache, so we always set the back channel ca_maxresponsesize_cached to zero when negotiating the session. That means we should always error out as soon as we see the server set args->csa_cachethis. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-24NFSv4.x: CB_SEQUENCE should return NFS4ERR_DELAY if still executingTrond Myklebust
See RFC5661 Section 2.10.6.2: if retrying a request, and the old one is still in progress, we must return NFS4ERR_DELAY as the reply to sequence. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>