aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2024-01-18Merge tag 'v5.4.267' into v5.4/standard/baseBruce Ashfield
This is the 5.4.267 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmWlao0ACgkQONu9yGCS # aT55SBAAu/fR/w4uhCqbJ2ygrz+0+kjAEfYCGK66OsfdRdFqANeiUANWHVzG7M4m # uAt2tB7jHFqXk0sStJ/CK5igyH7C5yEVTrU3txzR25bQad2m0R2lsbuveXWxFsrr # leklLO/96H8ao+iZ5yk5nGyB3dYRbw1qQIactYSzCqnTjwfn+uTeok0hFIu6gJKO # 7NYJxtgdWyFTq9o3AqVO6zCjrYRhbdANdzgCp9SZ/E6IiWp8Y9R+pg3n1fhZbUjS # hH/4pTdjLX050I1ikWV//zKG3OEQyV1LWxbky//uj62rq9FM2WWhc7TD1QqiH2Sf # oTY6GlSFFpxF7iM7kFDZTxr5A78Ui/fhGF9y+GQ+CZdqD5c/f8xzpNjSlLD28y0v # pxW9CecwSjv0HiPK/AZ+1vCS1fzZbn9v+MIr29sHrcH1BS6yYWSqzq/zrISGAA+L # kFVVrsGTmQHop9c1/DVx6i2Kdyr9+W/OAS3V3JnDkt6zkU4sqX/lT0BX6zNcxr0b # pAn5e3JxXZGUYug82VvWhaZhESkwBOxS62l0TD5iwnSF9macc2GMWbB0ZnR2jKpy # GxdxZVeZvQ2GYvFdQFHScg+tfmMLX+9WOcRI7J3PpEic8xQwM4Yb+QjN3nxARqtM # qrcZ7BY16q6/8ANO5cfsFR7Om1x769+hxOcoVjdf5WarwuwkvdY= # =0N1d # -----END PGP SIGNATURE----- # gpg: Signature made Mon 15 Jan 2024 12:25:33 PM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2024-01-15mm: fix unmap_mapping_range high bits shift bugJiajun Xie
commit 9eab0421fa94a3dde0d1f7e36ab3294fc306c99d upstream. The bug happens when highest bit of holebegin is 1, suppose holebegin is 0x8000000111111000, after shift, hba would be 0xfff8000000111111, then vma_interval_tree_foreach would look it up fail or leads to the wrong result. error call seq e.g.: - mmap(..., offset=0x8000000111111000) |- syscall(mmap, ... unsigned long, off): |- ksys_mmap_pgoff( ... , off >> PAGE_SHIFT); here pgoff is correctly shifted to 0x8000000111111, but pass 0x8000000111111000 as holebegin to unmap would then cause terrible result, as shown below: - unmap_mapping_range(..., loff_t const holebegin) |- pgoff_t hba = holebegin >> PAGE_SHIFT; /* hba = 0xfff8000000111111 unexpectedly */ The issue happens in Heterogeneous computing, where the device(e.g. gpu) and host share the same virtual address space. A simple workflow pattern which hit the issue is: /* host */ 1. userspace first mmap a file backed VA range with specified offset. e.g. (offset=0x800..., mmap return: va_a) 2. write some data to the corresponding sys page e.g. (va_a = 0xAABB) /* device */ 3. gpu workload touches VA, triggers gpu fault and notify the host. /* host */ 4. reviced gpu fault notification, then it will: 4.1 unmap host pages and also takes care of cpu tlb (use unmap_mapping_range with offset=0x800...) 4.2 migrate sys page to device 4.3 setup device page table and resolve device fault. /* device */ 5. gpu workload continued, it accessed va_a and got 0xAABB. 6. gpu workload continued, it wrote 0xBBCC to va_a. /* host */ 7. userspace access va_a, as expected, it will: 7.1 trigger cpu vm fault. 7.2 driver handling fault to migrate gpu local page to host. 8. userspace then could correctly get 0xBBCC from va_a 9. done But in step 4.1, if we hit the bug this patch mentioned, then userspace would never trigger cpu fault, and still get the old value: 0xAABB. Making holebegin unsigned first fixes the bug. Link: https://lkml.kernel.org/r/20231220052839.26970-1-jiajun.xie.sh@gmail.com Signed-off-by: Jiajun Xie <jiajun.xie.sh@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-15mm/memory-failure: check the mapcount of the precise pageMatthew Wilcox (Oracle)
[ Upstream commit c79c5a0a00a9457718056b588f312baadf44e471 ] A process may map only some of the pages in a folio, and might be missed if it maps the poisoned page but not the head page. Or it might be unnecessarily hit if it maps the head page, but not the poisoned page. Link: https://lkml.kernel.org/r/20231218135837.3310403-3-willy@infradead.org Fixes: 7af446a841a2 ("HWPOISON, hugetlb: enable error handling path for hugepage") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03Merge tag 'v5.4.262' into v5.4/standard/baseBruce Ashfield
This is the 5.4.262 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmVmGmUACgkQONu9yGCS # aT4V7A//YzFdP4ANGVpZ7tBob7OxpgGgvEu32zCDx51LQ8n2uJRJ8WBWW6kVOBUZ # YyUEXzjPPaS7JRS1O7TpCGYFWrH0ue9c/xzyvUQyyHEBZvZVj0P3O1iHlAk2FWSG # pOTEfW1cFp8vtHwGn82rmIDETu56LMWd+aeVhg6psb2L6ho2LPipCkxN79kbBGSB # DLfD71O2Pb3mw8ZYHVC5KKIlfODLqjq9N6T+3VsG4uQCEMHTVAHjjoIvYFeSi1cR # MqPXS4/3GUyYUDTe2tjYznkSfPbdARfD1aKKPEXLuq1+q6WqvHCAG7nwgtPT/gd9 # JPCxm+9DPN9+YhmEsCJpMSq3pD2eTrD5ZXhYFNc5sOsNw0L4oFRLtrB782snerw+ # ogQ8DED4qATn1+x7jfRD7hwMzHih4nAL7zqy32s8knKHfp1+rOOkXfIohfc9qrUI # svUjb1B+guuGHwFq6YDzxpUxmhdGqOo262cnU4jfH8lxH+w03vyNxxyQn0ZUUe2I # gkvJ5wNpq4QhD/++B/DaCptw0l5AzfjOO+0xlp20xMzn5qW/BS8W26zUXhGeLOAd # MHu+fv9DU0mzs3V1MxRvbBQ5gI9TngRWXJSIBCJx5YhZ8gGIhfrzoIzY+IeF6l3F # idjruirbfujAQv0vQHuz7JmhHrTG+T90slQ/R8pPud73WGz5BMI= # =A+DX # -----END PGP SIGNATURE----- # gpg: Signature made Tue 28 Nov 2023 11:50:45 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-12-03Merge tag 'v5.4.261' into v5.4/standard/baseBruce Ashfield
This is the 5.4.261 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmVbJywACgkQONu9yGCS # aT6YqQ/9FxwPj4QDAuA9/XXBN2VsBm8Xws1chEJzEIxWftOc4kzQGtdbvoxmoLsd # 6cyyd89oiQyme0qvnpPbGnGKbZo4I100ugS5ba6k+T1gHtVMkcVB8AmvLce/wHip # oqxb340oK5leZxmJs2hm/S/rCEr7RlcB/2ZMGle+Bw3xRone4S9MQ/PXXeKzH1dQ # 9i34Vv8fAgBV/CCJ47QneGAMFj2gtr893us7MOJkyJVIM2tFCciZbclY8DTH5Umr # L7MhX4dCd2Bm1umtidt5f6fpmykHRFHSDb/STHS9Dg5BUSpEhr2H5/hq2spb6492 # 6xwQ/Q6WrP1scaIvSA8jChxed+wSxKTVd7hLXJiRoG03qGx4uD3udEus9+V4xbY/ # lOR9+QuZl7V5tGWYtC9KXUwOo9PkjX5Fi4mXKlEqrL25NTU8KO3AwMxKcGy/MNgn # xzSNM4u5iA+dNIsP0Xe/1QZ92c20p6oFKlllHFXVzH+PWwZN3X6Jm+MyeJb0AMj9 # MaaTa3ZDI5OiK72gBp4KH4ICEenCTFqsAsyyAai2jlzhPSHNWlKoiJbU8uacwiyU # tRctf4QCuZa2EA/u2qIqeWNIsionGGsPABd+hNp+Gvc2QqO4cd8p8diTO0V5aKOz # PvpUY29p3v/qyF8ofIO0GYE0Wa0YQaS7X2f4Pc4NGA1Uk6wLzLc= # =k5NF # -----END PGP SIGNATURE----- # gpg: Signature made Mon 20 Nov 2023 04:30:20 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-11-28mm/cma: use nth_page() in place of direct struct page manipulationZi Yan
commit 2e7cfe5cd5b6b0b98abf57a3074885979e187c1c upstream. Patch series "Use nth_page() in place of direct struct page manipulation", v3. On SPARSEMEM without VMEMMAP, struct page is not guaranteed to be contiguous, since each memory section's memmap might be allocated independently. hugetlb pages can go beyond a memory section size, thus direct struct page manipulation on hugetlb pages/subpages might give wrong struct page. Kernel provides nth_page() to do the manipulation properly. Use that whenever code can see hugetlb pages. This patch (of 5): When dealing with hugetlb pages, manipulating struct page pointers directly can get to wrong struct page, since struct page is not guaranteed to be contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle it properly. Without the fix, page_kasan_tag_reset() could reset wrong page tags, causing a wrong kasan result. No related bug is reported. The fix comes from code inspection. Link: https://lkml.kernel.org/r/20230913201248.452081-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20230913201248.452081-2-zi.yan@sent.com Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20vfs: fix readahead(2) on block devicesReuben Hawkins
[ Upstream commit 7116c0af4b8414b2f19fdb366eea213cbd9d91c2 ] Readahead was factored to call generic_fadvise. That refactor added an S_ISREG restriction which broke readahead on block devices. In addition to S_ISREG, this change checks S_ISBLK to fix block device readahead. There is no change in behavior with any file type besides block devices in this change. Fixes: 3d8f7615319b ("vfs: implement readahead(2) using POSIX_FADV_WILLNEED") Signed-off-by: Reuben Hawkins <reubenhwk@gmail.com> Link: https://lore.kernel.org/r/20231003015704.2415-1-reubenhwk@gmail.com Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-26Merge tag 'v5.4.257' into v5.4/standard/baseBruce Ashfield
This is the 5.4.257 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmUOqSMACgkQONu9yGCS # aT6xIg//SVVT7zeyVcdNSchMLT6N1sJKtnplNnhyM6oFPlnyRJbgm608p394osx9 # bMkz8QNPugdJz075nFt1blC2qqh2GqNkgaAM1bSKrVmUhBR3ouaO2vKfTamd1qkQ # uHjE2+4NSlJu0zeqF+D+xmYYo3W32XXfDjn64p3dYiEVFtM4J0r633OpkNTZL3KR # b8Ooj0sE6WtG5Lt4I64z74/p8QjK8ESW7N7hYUjADadoycn7ms5wwED6KbXwO+Ed # 3piSteS8bddtx+s6pblRwHvRcOMU3NX0rVG8x3lBtdnjAk32/HEsUm7mAycqJdsJ # TQ67UJ4gyqzrCtDfrbhZ9hKpaEHGuy6nnjKfXtnlSKZ+8h4uuxK0rIwFlZuS+sjH # Xm99yiA6KK+CbdR9/ltgQyr5kaTcIqauA6VTjbqqJ3Fuj4OWEz3N2ALUpWeLPNpe # Enl7b5/eQ4B0sDOYDVG4HsjRTt7ZgNVGFxRRp8ZulDKgX9G4M0K2khq/b3PM9aEQ # gkgWDxLt3H0EO+6mRgCA0J3a/TSC6gPgV8t8iNcg5rzlXngJzAajdgi7HBMnhPdl # 8y8JCfojtA+RuHWHOEmPXJG1AmwQ4df7szVxbv8WDuidIqv2tb09POo38s/UWHeN # NGM5nh1WSCs4hQBfkx4wk58xSZ/jAh4/Uq6g3GasmqlknhA8TjQ= # =dWOv # -----END PGP SIGNATURE----- # gpg: Signature made Sat 23 Sep 2023 05:00:19 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-09-23tmpfs: verify {g,u}id mount options correctlyChristian Brauner
[ Upstream commit 0200679fc7953177941e41c2a4241d0b6c2c5de8 ] A while ago we received the following report: "The other outstanding issue I noticed comes from the fact that fsconfig syscalls may occur in a different userns than that which called fsopen. That means that resolving the uid/gid via current_user_ns() can save a kuid that isn't mapped in the associated namespace when the filesystem is finally mounted. This means that it is possible for an unprivileged user to create files owned by any group in a tmpfs mount (since we can set the SUID bit on the tmpfs directory), or a tmpfs that is owned by any user, including the root group/user." The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, tmpfs has been doing the correct thing. But since tmpfs is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock to avoid such bugs as above. The new mount api's cross-namespace delegation abilities are already widely used. After having talked to a bunch of userspace this is the most faithful solution with minimal regression risks. I know of one users - systemd - that makes use of the new mount api in this way and they don't set unresolable {g,u}ids. So the regression risk is minimal. Link: https://lore.kernel.org/lkml/CALxfFW4BXhEwxR0Q5LSkg-8Vb4r2MONKCcUCVioehXQKr35eHg@mail.gmail.com Fixes: f32356261d44 ("vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API") Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Reported-by: Seth Jenkins <sethjenkins@google.com> Message-Id: <20230801-vfs-fs_context-uidgid-v1-1-daf46a050bbf@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-04Merge tag 'v5.4.255' into v5.4/standard/baseBruce Ashfield
This is the 5.4.255 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmTvUhUACgkQONu9yGCS # aT4bKA//VvBb7CUEq4FFMv5qig67dKUIqJVfpwLrqaCqVR8B0QonL1M5dcKXywwT # zFqcQNGmgig9TtbYmrLtcpI/v3J3jilY7/an5dWBEPteyZgpkpAwO3M7MinbtIbj # qRkU5qN/zojUMqgWUYRenICeiN4EOVQ64/Q9fhbj2yFBeQWzCFb0eoeF059DocTD # UzN1Ls+cYHvZEDi0VEiapQzYX1JcxMbuWaGDttQLDvjV6FMaExT5mIobDqSF+9MA # MS9GGj3R/Q+NjOi/AXEMfnWGEYPLsX5hgM3ok2hjyneJiw1J6OqxG1JoPJAnDUEH # d3u/tlcWQ0j/QP0iNZBvC9aVC9YBndOoaAny5QINoLGQsbeCbZ34cKs80p76xTBa # Vvl/B2pFu3pGVBk7f37rf/D2v/MTxkDONxwBzG4J6uDViPgpIDK7UExjGDub6gf1 # Ii5HmXvGCNwIk3NnCpdaHUQy3XRI7cz24kvDZsqkalMW6GYwlVNj9gikcW3dfOVY # Jsdufo9fM5N3jXbru3NW61ne024+NxGRd3SnUsYB/saKfUZAxm0S/O34fzQi3wZx # VLXFB85DIY5gkYl2VeycDZzmVkFEaDP4vzDR1gCmMTaiQsyQuD5wma6dUGggdF/2 # fvigMgosamWhHHHByASp9RxYRBwTe7vEdFE4+8gbEa7NxMoBcg8= # =Dhtw # -----END PGP SIGNATURE----- # gpg: Signature made Wed 30 Aug 2023 10:28:37 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-08-30mm: allow a controlled amount of unfairness in the page lockLinus Torvalds
commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79 upstream. Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in while the lock was held, the lock would be transferred to the lockers strictly in order. That was intended to finally get rid of the long-reported watchdog failures that involved the page lock under extreme load, where a process could end up waiting essentially forever, as other page lockers stole the lock from under it. It also improved some benchmarks, but it ended up causing huge performance regressions on others, simply because fair lock behavior doesn't end up giving out the lock as aggressively, causing better worst-case latency, but potentially much worse average latencies and throughput. Instead of reverting that change entirely, this introduces a controlled amount of unfairness, with a sysctl knob to tune it if somebody needs to. But the default value should hopefully be good for any normal load, allowing a few rounds of lock stealing, but enforcing the strict ordering before the lock has been stolen too many times. There is also a hint from Matthieu Baerts that the fair page coloring may end up exposing an ABBA deadlock that is hidden by the usual optimistic lock stealing, and while the unfairness doesn't fix the fundamental issue (and I'm still looking at that), it avoids it in practice. The amount of unfairness can be modified by writing a new value to the 'sysctl_page_lock_unfairness' variable (default value of 5, exposed through /proc/sys/vm/page_lock_unfairness), but that is hopefully something we'd use mainly for debugging rather than being necessary for any deep system tuning. This whole issue has exposed just how critical the page lock can be, and how contended it gets under certain locks. And the main contention doesn't really seem to be anything related to IO (which was the origin of this lock), but for things like just verifying that the page file mapping is stable while faulting in the page into a page table. Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/ Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1 Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/ Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Tested-by: Maximilian Heyne <mheyne@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-03Merge tag 'v5.4.249' into v5.4/standard/baseBruce Ashfield
This is the 5.4.249 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmSb7V4ACgkQONu9yGCS # aT5vLxAA0yhg7h210wyMLrPNgQHrIItxkvcosoAG04WziImnvTT84XYpvthKlQrZ # jzLGwdrH8ggdZIq+jPblmGvfvpGuM7MjKw1F8tgmviMnMyfKziGO/kIEzkNPaHSt # sRFuGniXx2Q/m2IVblhC8pqJG6SRgkBbNgg3by7SpTRSEHBjpxaOVxvGC53Bdlkb # ep90ox3iVbA4Q45rGCn5UfJM22wEnUYbzRv04085fzWaPDEZyHi5S6a3rHepVbrq # 7ElDQgUgHKlLm7rd1ngB8Ac+EdfavVcPok789pbEmQwf6jsAetl43yPUSEE6xFXb # 5FZAA7uUUa+E7P+140+iWBCZwQX9g+WglEkOxJV8gOMtWoiFZjpPcJxyWnvz/7ch # XFz88WW/Ub4+bpg62TJ2F3dboeF0x1rN5kB8/ylb+Gf9vACT2gPLDbFaeG24DZEr # s1hdsRx1Q3m8ffOYbsuTTn3bfGv8TfycV4Cwy+v+QPwJF/WPdMUnIDRY7VgWJ6fO # scRdhkgMer9MLDrcSwxgS3tyn6JObQMp5A40H1Yb6ZVwN+q2BRC/B4Gqi6BmUNKr # uU0BRMeyExyyQfKYCgvcf0M23qUf5L4PDpk1MX38pU+AHm8rPHlE36/pNFG4PG0g # p6vBTlKzYeHKh12VAdPJjiWICloaz2ixf3K85xJ+vH56jXfjbSY= # =3Pqk # -----END PGP SIGNATURE----- # gpg: Signature made Wed 28 Jun 2023 04:20:46 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-06-28mm: make wait_on_page_writeback() wait for multiple pending writebacksLinus Torvalds
commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream. Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") we've had some very occasional reports of BUG_ON(PageWriteback) in write_cache_pages(), which we thought we already fixed in commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"). But syzbot just reported another one, even with that commit in place. And it turns out that there's a simpler way to trigger the BUG_ON() than the one Hugh found with page re-use. It all boils down to the fact that the page writeback is ostensibly serialized by the page lock, but that isn't actually really true. Yes, the people _setting_ writeback all do so under the page lock, but the actual clearing of the bit - and waking up any waiters - happens without any page lock. This gives us this fairly simple race condition: CPU1 = end previous writeback CPU2 = start new writeback under page lock CPU3 = write_cache_pages() CPU1 CPU2 CPU3 ---- ---- ---- end_page_writeback() test_clear_page_writeback(page) ... delayed... lock_page(); set_page_writeback() unlock_page() lock_page() wait_on_page_writeback(); wake_up_page(page, PG_writeback); .. wakes up CPU3 .. BUG_ON(PageWriteback(page)); where the BUG_ON() happens because we woke up the PG_writeback bit becasue of the _previous_ writeback, but a new one had already been started because the clearing of the bit wasn't actually atomic wrt the actual wakeup or serialized by the page lock. The reason this didn't use to happen was that the old logic in waiting on a page bit would just loop if it ever saw the bit set again. The nice proper fix would probably be to get rid of the whole "wait for writeback to clear, and then set it" logic in the writeback path, and replace it with an atomic "wait-to-set" (ie the same as we have for page locking: we set the page lock bit with a single "lock_page()", not with "wait for lock bit to clear and then set it"). However, out current model for writeback is that the waiting for the writeback bit is done by the generic VFS code (ie write_cache_pages()), but the actual setting of the writeback bit is done much later by the filesystem ".writepages()" function. IOW, to make the writeback bit have that same kind of "wait-to-set" behavior as we have for page locking, we'd have to change our roughly ~50 different writeback functions. Painful. Instead, just make "wait_on_page_writeback()" loop on the very unlikely situation that the PG_writeback bit is still set, basically re-instating the old behavior. This is very non-optimal in case of contention, but since we only ever set the bit under the page lock, that situation is controlled. Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)Hugh Dickins
commit 073861ed77b6b957c3c8d54a11dc503f7d986ceb upstream. Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). Matthew Wilcox explaining this to himself: "page is allocated, added to page cache, dirtied, writeback starts, --- thread A --- filesystem calls end_page_writeback() test_clear_page_writeback() --- context switch to thread B --- truncate_inode_pages_range() finds the page, it doesn't have writeback set, we delete it from the page cache. Page gets reallocated, dirtied, writeback starts again. Then we call write_cache_pages(), see PageWriteback() set, call wait_on_page_writeback() --- context switch back to thread A --- wake_up_page(page, PG_writeback); ... thread B is woken, but because the wakeup was for the old use of the page, PageWriteback is still set. Devious" And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") this would have been much less likely: before that, wake_page_function()'s non-exclusive case would stop walking and not wake if it found Writeback already set again; whereas now the non-exclusive case proceeds to wake. I have not thought of a fix that does not add a little overhead: the simplest fix is for end_page_writeback() to get_page() before calling test_clear_page_writeback(), then put_page() after wake_up_page(). Was there a chance of missed wakeups before, since a page freed before reaching wake_up_page() would have PageWaiters cleared? I think not, because each waiter does hold a reference on the page. This bug comes when the old use of the page, the one we do TestClearPageWriteback on, had *no* waiters, so no additional page reference beyond the page cache (and whoever racily freed it). The reuse of the page has a waiter holding a reference, and its own PageWriteback set; but the belated wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback). Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com Reported-by: Qian Cai <cai@lca.pw> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28list: add "list_del_init_careful()" to go with "list_empty_careful()"Linus Torvalds
[ Upstream commit c6fe44d96fc1536af5b11cd859686453d1b7bfd1 ] That gives us ordering guarantees around the pair. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 2192bba03d80 ("epoll: ep_autoremove_wake_function should use list_del_init_careful") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28mm: rewrite wait_on_page_bit_common() logicLinus Torvalds
[ Upstream commit 2a9127fcf2296674d58024f83981f40b128fffea ] It turns out that wait_on_page_bit_common() had several problems, ranging from just unfair behavioe due to re-queueing at the end of the wait queue when re-trying, and an outright bug that could result in missed wakeups (but probably never happened in practice). This rewrites the whole logic to avoid both issues, by simply moving the logic to check (and possibly take) the bit lock into the wakeup path instead. That makes everything much more straightforward, and means that we never need to re-queue the wait entry: if we get woken up, we'll be notified through WQ_FLAG_WOKEN, and the wait queue entry will have been removed, and everything will have been done for us. Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/ Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/ Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 2192bba03d80 ("epoll: ep_autoremove_wake_function should use list_del_init_careful") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-12Merge tag 'v5.4.246' into v5.4/standard/baseBruce Ashfield
This is the 5.4.246 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmSC4twACgkQONu9yGCS # aT7+DhAAtEjTkoxf13gC+oPdj7XmBHx2uXkXm0BbkjxMTs/HCM9Eyf1FKTZBEawX # HolNx+akXd2R2xtRCPoV4xD3qy3o2xhLspCfNXoiOU+oL+ej+tNid+6WAl4wtK2w # mafRGafVcNuxOOOgy8B/Bs+fNy+kjEa9riw8VvcgcTYhyOhJnLcKvjpaaJ5AFzxa # i/dwVKo4YkPQNQuN9iPHiubqkVgc83OiXDdhX1YkwXlxnuMJ8sUy1exTmLNQxk2g # hWUtANARnsXI36qv14Gy3iQeJdNGqWq79FWEhfUqVnrFF9Qhn1IoCuGb+wk2s73i # UPdId/iElkYdrDD0PFHKNj7aKeYYanoi3YvrBniS2qT7wwCGnBlXoR1QIuwGCPll # etjWdR9roOBHKA3rRAYPItrVXk2gNJafyBpCW/Zke6QYrXkDvu+m0Oz2C7jttFwN # BCDDIqdWmPJBgU8CJ+sguXsAkHRahWI59NQX5BmgxGooshtPdqPRegCUTPqxhXlP # YFyPyLqQ/s+xMu4cE/MYqRiKSfRN61sSYWO8R2W8fQ60Kd+FxIKZvnhiEEWlpbGv # i9ZcHkQVG/eubWhutj9GBDsZ4auzXfzxLkTBW94NXlycjjG4g0QRkBl+4UC3BJO+ # HNTAO9x7KRnqPGMEJVL8QzaWYvXnnOGfC+7yBE75c/iQiCcY3gI= # =fzPK # -----END PGP SIGNATURE----- # gpg: Signature made Fri 09 Jun 2023 04:29:16 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-06-09treewide: Remove uninitialized_var() usageKees Cook
commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream. Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-19Merge tag 'v5.4.243' into v5.4/standard/baseBruce Ashfield
This is the 5.4.243 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmRkoEsACgkQONu9yGCS # aT6nbBAAxLX8QMuKuA8fcSFqQTZwrGAW/x7aOih1Sgkw/pttE8t8/q9sxlPZHljK # UnZWzy/xjBayWA4aEskkd8pvZh7uXqcQH56UuiuzTiZwNtKQfAlvbVjsibzOk8mt # leuNP1F/Kod7CFYi/o8yoo4tUrWPmNLgc5ZaAvR/FYapanpYLB/6I9u2mf8HPjRP # tF1PwYPl9V7NdiAx5Liw6mczBI+v05FY7+G2tsUrnE/XM3SFOg8mwKNTksBeiZ8a # vZxCwQgTohUR2yKMjSrsKnZ2sQAoskOlpc8YpdwSk2s7KZKf+QcI6Y2BhneK/A7+ # BU9vQr8Y0qrciBrpZvBGLcBhcmXUQwgZBh4VKUwJCUWijSQRSjhs/3+rAyvj74rF # w8hP6EDgyAb5fKSU//MAZiFqdQfzowGne2Uin/rgyhyK9l+zxRCRtY1Ra+T75Jvl # 2MNU+VwvfRzzGJtP4BiuA2qoHsTqmLK2SUUrqmhyRm2D3cK17NuIJeGMwt3BXDzw # g+FpXoVGmkmfl+HHQLWdqpJ654APpJgxjhK6Hjca5608V+FIW7FGScAWX2CRmpUK # rTAUPloptXIuo41CI+z7hdmYSfFtJymOgd650p5ntmro+7tMRQkhhjnEDDF8y1Jr # 703VIa3QkRWRE5/xGi2KM2GgEH81j0s2Nyo/7JQtiitOjqtpgJ4= # =SrzM # -----END PGP SIGNATURE----- # gpg: Signature made Wed 17 May 2023 05:37:15 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-05-17mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlockTetsuo Handa
commit 1007843a91909a4995ee78a538f62d8665705b66 upstream. syzbot is reporting circular locking dependency which involves zonelist_update_seq seqlock [1], for this lock is checked by memory allocation requests which do not need to be retried. One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler. CPU0 ---- __build_all_zonelists() { write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd // e.g. timer interrupt handler runs at this moment some_timer_func() { kmalloc(GFP_ATOMIC) { __alloc_pages_slowpath() { read_seqbegin(&zonelist_update_seq) { // spins forever because zonelist_update_seq.seqcount is odd } } } } // e.g. timer interrupt handler finishes write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even } This deadlock scenario can be easily eliminated by not calling read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation requests. But Michal Hocko does not know whether we should go with this approach. Another deadlock scenario which syzbot is reporting is a race between kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with port->lock held and printk() from __build_all_zonelists() with zonelist_update_seq held. CPU0 CPU1 ---- ---- pty_write() { tty_insert_flip_string_and_push_buffer() { __build_all_zonelists() { write_seqlock(&zonelist_update_seq); build_zonelists() { printk() { vprintk() { vprintk_default() { vprintk_emit() { console_unlock() { console_flush_all() { console_emit_next_record() { con->write() = serial8250_console_write() { spin_lock_irqsave(&port->lock, flags); tty_insert_flip_string() { tty_insert_flip_string_fixed_flag() { __tty_buffer_request_room() { tty_buffer_alloc() { kmalloc(GFP_ATOMIC | __GFP_NOWARN) { __alloc_pages_slowpath() { zonelist_iter_begin() { read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held } } } } } } } } spin_unlock_irqrestore(&port->lock, flags); // message is printed to console spin_unlock_irqrestore(&port->lock, flags); } } } } } } } } } write_sequnlock(&zonelist_update_seq); } } } This deadlock scenario can be eliminated by preventing interrupt context from calling kmalloc(GFP_ATOMIC) and preventing printk() from calling console_flush_all() while zonelist_update_seq.seqcount is odd. Since Petr Mladek thinks that __build_all_zonelists() can become a candidate for deferring printk() [2], let's address this problem by disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC) and disabling synchronous printk() in order to avoid console_flush_all() . As a side effect of minimizing duration of zonelist_update_seq.seqcount being odd by disabling synchronous printk(), latency at read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e. do not record unnecessary locking dependency) from interrupt context is still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq) section... Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2] Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Petr Mladek <pmladek@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Patrick Daly <quic_pdaly@quicinc.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-21Merge tag 'v5.4.241' into v5.4/standard/baseBruce Ashfield
This is the 5.4.241 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmRBDvQACgkQONu9yGCS # aT58+g/+PmTvsCcjSlTvpHw+kFsIu2F2pIjlJNlX+YHSzEYMG/G2mn7XD1rUE7JB # mQQ7hFXEeQZTuN4zsFnviotzn1xd6bV26Od598m+yFZ3Fty07NXy3OD1/08ZdP0G # k+U9u4TDr47BV3WdokGWIdXAjnlvC0kwjc+EWznOmXquhW8hSrGvz0IUJQWgDue1 # IpOkuRs60T85dUHoPYEL8T23XRyhFhdeIrlx6Hqzlp1fGg7dSpfh0yYJmDLF6Y21 # CXO2WcADhMMszk+CFEXgfWVem5mheUuVsfmBUg34ZGzzNos0rPyHG+VmwZxnsPYE # SVLVxLL649CzAuJdpXflTghdBUU/WKT/ZZ0aQddohPjij55My2oCV6/FpHXg7kuo # ZTXN0ByyeAbV4DlY2+ooY6Vhr5lzQG1l15Ap9xj4ioQO8U3jeCCQ5wT9L01ig6a9 # /9U9wb6CYp3thrY94x1WapJF5utiIigbiS+rioRfAwHAzj5JiLfQvdH1nVgBNz+9 # DWCEGIiUONmHMRrt+X6Nnu8KkHWOkDFF2lisphXUsaY140gFG0+d3xnRArsU4ZDi # j1zT0ErqigV6vzA39ic898EW8wkNqVCWPwYRLVRSBuPCDKK7SjnOuStGi7eIoYp2 # zI2NV9UUMm5Qxub5zUNktpWDFFn9knN3E7Gc4wAKWkHkWc/DJ+U= # =ffeb # -----END PGP SIGNATURE----- # gpg: Signature made Thu 20 Apr 2023 06:07:48 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-04-20mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()Rongwei Wang
commit 6fe7d6b992113719e96744d974212df3fcddc76c upstream. The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below: core 0 core 1 swapoff del_from_avail_list(si) waiting try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head acquire si->lock but missing si already being added again, and continuing to clear SWP_WRITEOK, etc. It can be easily found that a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap). However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled) ------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270 Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again. This problem exists in versions after stable 5.10.y. Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com Fixes: a2468cc9bfdff ("swap: choose swap device according to numa node") Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17Merge tag 'v5.4.235' into v5.4/standard/baseBruce Ashfield
This is the 5.4.235 stable release # -----BEGIN PGP SIGNATURE----- # # iQIyBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmQModwACgkQONu9yGCS # aT7W7A/1EyhortcaMdZXEkdl7kZYupASsOm2QgOzeRkK0ELtbYRTt1qXdZgl40hU # binrh5Yib2avHTEAF9I6AKVXMirSUTtODe/zQ7icyxVNcXeanlIbobEVBzSWIBtC # Wxj129KZyCQlucagWihngQ9D+66bvD5JCsJ3EHKJjpheSqmZI88KVnOSnvyoJArj # yLDY21UgxRN4KASgB+tpLBT4x0yN9zk8VuCGpyJjO/nHzhj6Y6DkOcx2q7hAxdn+ # H1OBCQ2QBCODCMrpW4xBuwy2blBZsRytUdEy8JsfxjgXvUp8+TdxUsuxb16a31jW # pVo9LYB0cdKVoAzNJ2pTD8rhaATSbq+2MYDEUYCz8Rr+dZ/Nt2nTKSYeJprLsTwx # TzPRNErQMKxKoQUQU/seWx47ebwt+Z8Rk4FAoyQMxRITw/9bBGLWpDKrGjNsByz9 # A2Q9UU+uM+jyqZnjQMvkzKSznggwfJ+SgaeqDMjwyyCQysJS8DTXPr9nA+IC9cht # Kz00QetNgvPvZPE/gg81XOcKtJVTmA4AITQ0PlxYJT0hHCHx02GxvdPH2XBspgUt # aNbDgVsupq8ONvRZlEf9hJKltTUmIRvI9JSOXnuhaN2jCv88SNv1M0TKfAo0XDNK # Z/prv3qCnugMZ0KB0TD7d09XqSlKbefOq8TdtbXoTcC0NzFQkw== # =29jZ # -----END PGP SIGNATURE----- # gpg: Signature made Sat 11 Mar 2023 10:44:28 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-03-11mm/thp: check and bail out if page in deferred queue alreadyYin Fengwei
commit 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 upstream. Kernel build regression with LLVM was reported here: https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP boundaries"). And the commit f35b5d7d676e was reverted. It turned out the regression is related with madvise(MADV_DONTNEED) was used by ld.lld. But with none PMD_SIZE aligned parameter len. trace-bpfcc captured: 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 If the underneath physical page is THP, the madvise(MADV_DONTNEED) can trigger split_queue_lock contention raised significantly. perf showed following data: 14.85% 0.00% ld.lld [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 11.52% entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_madvise do_madvise.part.0 zap_page_range unmap_single_vma unmap_page_range page_remove_rmap deferred_split_huge_page __lock_text_start native_queued_spin_lock_slowpath If THP can't be removed from rmap as whole THP, partial THP will be removed from rmap by removing sub-pages from rmap. Even the THP head page is added to deferred queue already, the split_queue_lock will be acquired and check whether the THP head page is in the queue already. Thus, the contention of split_queue_lock is raised. Before acquire split_queue_lock, check and bail out early if the THP head page is in the queue already. The checking without holding split_queue_lock could race with deferred_split_scan, but it doesn't impact the correctness here. Test result of building kernel with ld.lld: commit 7b5a0b664ebe (parent commit of f35b5d7d676e): time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:07.99 real, 26367.77 user, 5063.35 sys commit f35b5d7d676e: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 7:22.15 real, 26235.03 user, 12504.55 sys commit f35b5d7d676e with the fixing patch: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:08.49 real, 26520.15 user, 5047.91 sys Link: https://lkml.kernel.org/r/20221223135207.2275317-1-fengwei.yin@intel.com Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11mm: memcontrol: deprecate charge movingJohannes Weiner
commit da34a8484d162585e22ed8c1e4114aa2f60e3567 upstream. Charge moving mode in cgroup1 allows memory to follow tasks as they migrate between cgroups. This is, and always has been, a questionable thing to do - for several reasons. First, it's expensive. Pages need to be identified, locked and isolated from various MM operations, and reassigned, one by one. Second, it's unreliable. Once pages are charged to a cgroup, there isn't always a clear owner task anymore. Cache isn't moved at all, for example. Mapped memory is moved - but if trylocking or isolating a page fails, it's arbitrarily left behind. Frequent moving between domains may leave a task's memory scattered all over the place. Third, it isn't really needed. Launcher tasks can kick off workload tasks directly in their target cgroup. Using dedicated per-workload groups allows fine-grained policy adjustments - no need to move tasks and their physical pages between control domains. The feature was never forward-ported to cgroup2, and it hasn't been missed. Despite it being a niche usecase, the maintenance overhead of supporting it is enormous. Because pages are moved while they are live and subject to various MM operations, the synchronization rules are complicated. There are lock_page_memcg() in MM and FS code, which non-cgroup people don't understand. In some cases we've been able to shift code and cgroup API calls around such that we can rely on native locking as much as possible. But that's fragile, and sometimes we need to hold MM locks for longer than we otherwise would (pte lock e.g.). Mark the feature deprecated. Hopefully we can remove it soon. And backport into -stable kernels so that people who develop against earlier kernels are warned about this deprecation as early as possible. [akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-27Merge tag 'v5.4.232' into v5.4/standard/baseBruce Ashfield
This is the 5.4.232 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmP2AZoACgkQONu9yGCS # aT4Kog//cMOCPvc+9yam5NCZj76k9jzIfteKMZzvSyxjV/ShPGynLIwcR26vE4j1 # CtEB0aknuxgpqthfCahjjf51POhYLJD62H62UtTfxgIkWxnETd8F6y2xvuVXsds5 # mC0LUzQ9md6slgTTIobQF9ilIGAt/yKPOg89fUXNYzNsO2us46XZCmZOXg5MVwlI # hXYQuVBze1VhWt40J8TYDFckjoQLUgH6lBawWHC8/r2MBBydzX1cZEyL2TXhDfFS # 7t9gWXKteAFE6GWfgAY6MrtqGx+X25Xe7qds4V8v6FgxR2MFeo94+k3DbhXRnjjY # gA6czJBurGzhiXWo2E4laYlEMfsY0qkl17M49C/LwkJhZCSjF60b0Vo0dNfLLogZ # cWsG6qcrfV8/js5h97kFSluWZ4VM7xTgcJQ/qtU/O8IprRQioCERjvm4Dl+/emXI # ycFaiZOP3RvYdHxADIsItm46C7WzpzqZpjjs+9jHEaACrnOQfepGGFmgImMd9P8r # kkU5KUtPQoAgSFfPz1tvJgyQiazONRAKtg1UprPOnLN0PsBQrE8ekCOk9lDoW60l # t+G2lC0dJBYkcKC+4jHa9y18U7wz/eYdYE+K/u8kUENYFLSBfYxIqbXPxQZcq6aO # TnyVr1n+Dd3HXtLX58+vDE2RUjosvCXctBGrE6Q56d8AKXh6FvM= # =rk4j # -----END PGP SIGNATURE----- # gpg: Signature made Wed 22 Feb 2023 06:50:50 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-02-22Revert "mm: Always release pages to the buddy allocator in ↵Aaron Thompson
memblock_free_late()." commit 647037adcad00f2bab8828d3d41cd0553d41f3bd upstream. This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593. The pages being freed by memblock_free_late() have already been initialized, but if they are in the deferred init range, __free_one_page() might access nearby uninitialized pages when trying to coalesce buddies. This can, for example, trigger this BUG: BUG: unable to handle page fault for address: ffffe964c02580c8 RIP: 0010:__list_del_entry_valid+0x3f/0x70 <TASK> __free_one_page+0x139/0x410 __free_pages_ok+0x21d/0x450 memblock_free_late+0x8c/0xb9 efi_free_boot_services+0x16b/0x25c efi_enter_virtual_mode+0x403/0x446 start_kernel+0x678/0x714 secondary_startup_64_no_verify+0xd2/0xdb </TASK> A proper fix will be more involved so revert this change for the time being. Fixes: 115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().") Signed-off-by: Aaron Thompson <dev@aaront.org> Link: https://lore.kernel.org/r/20230207082151.1303-1-dev@aaront.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22migrate: hugetlb: check for hugetlb shared PMD in node migrationMike Kravetz
commit 73bdf65ea74857d7fb2ec3067a3cec0e261b1462 upstream. migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to move pages shared with another process to a different node. page_mapcount > 1 is being used to determine if a hugetlb page is shared. However, a hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. As a result, hugetlb pages shared by multiple processes and mapped with a shared PMD can be moved by a process without CAP_SYS_NICE. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found consider the page shared. Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22mm: swap: properly update readahead statistics in unuse_pte_range()Andrea Righi
commit ebc5951eea499314f6fbbde20e295f1345c67330 upstream. In unuse_pte_range() we blindly swap-in pages without checking if the swap entry is already present in the swap cache. By doing this, the hit/miss ratio used by the swap readahead heuristic is not properly updated and this leads to non-optimal performance during swapoff. Tracing the distribution of the readahead size returned by the swap readahead heuristic during swapoff shows that a small readahead size is used most of the time as if we had only misses (this happens both with cluster and vma readahead), for example: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 36948 $retval = 8 44151 $retval = 4 49290 $retval = 1 527771 $retval = 2 Checking if the swap entry is present in the swap cache, instead, allows to properly update the readahead statistics and the heuristic behaves in a better way during swapoff, selecting a bigger readahead size: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 1618 $retval = 1 4960 $retval = 2 41315 $retval = 4 103521 $retval = 8 In terms of swapoff performance the result is the following: Testing environment =================== - Host: CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache) HDD: PC401 NVMe SK hynix 512GB MEM: 16GB - Guest (kvm): 8GB of RAM virtio block driver 16GB swap file on ext4 (/swapfile) Test case ========= - allocate 85% of memory - `systemctl hibernate` to force all the pages to be swapped-out to the swap file - resume the system - measure the time that swapoff takes to complete: # /usr/bin/time swapoff /swapfile Result (swapoff time) ====== 5.6 vanilla 5.6 w/ this patch ----------- ----------------- cluster-readahead 22.09s 12.19s vma-readahead 18.20s 15.33s Conclusion ========== The specific use case this patch is addressing is to improve swapoff performance in cloud environments when a VM has been hibernated, resumed and all the memory needs to be forced back to RAM by disabling swap. This change allows to better exploits the advantages of the readahead heuristic during swapoff and this improvement allows to to speed up the resume process of such VMs. [andrea.righi@canonical.com: update changelog] Link: http://lkml.kernel.org/r/20200418084705.GA147642@xps-13 Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Anchal Agarwal <anchalag@amazon.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vineeth Remanan Pillai <vpillai@digitalocean.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Link: http://lkml.kernel.org/r/20200416180132.GB3352@xps-13 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Luiz Capitulino <luizcap@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22mm/swapfile: add cond_resched() in get_swap_pages()Longlong Xia
commit 7717fc1a12f88701573f9ed897cc4f6699c661e3 upstream. The softlockup still occurs in get_swap_pages() under memory pressure. 64 CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram device is 50MB with same priority as si. Use the stress-ng tool to increase memory pressure, causing the system to oom frequently. The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens of thousands of times to find available space (extreme case: cond_resched() is not called in scan_swap_map_slots()). Let's add cond_resched() into get_swap_pages() when failed to find available space to avoid softlockup. Link: https://lkml.kernel.org/r/20230128094757.1060525-1-xialonglong1@huawei.com Signed-off-by: Longlong Xia <xialonglong1@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-17Merge tag 'v5.4.231' into v5.4/standard/baseBruce Ashfield
This is the 5.4.231 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmPgo/YACgkQONu9yGCS # aT4o0RAAlt2uWRXaiDW2cYi1dKAuqk8Iyf0tlonzDkSESy6Qy28rw62BIbBRHFNv # ObPjlz4FgI9ZfSVBsolFjBACTXvzS/fPvvqBEVmWqLA0+cN0/RRsJ8AJYV+wxV4U # j0h+asxtkaWxhPmsnr0FtVG6KnqMCZkYCJYzkEwMmGZqmhkvqZVtGO5Hxwa+pTuD # A+EpvsRCeqK42GqM1nn14er7Cej2bX6eM+MX1vhA/rNGgf4OrHSs5CQaLWFioFUO # VN1I2/aiC+iqpF8poPC4evDgko291s+QYvtIRqcfCGjJqpfwGDWA8xReZPXKD4+4 # JeY0WXHxtbjg1B+FQKZR4ESYlZfBLejI94CN32VJ3uI6CV+VgIyJMBXQ1Vs09OeN # IEighGiXTHezS5NvHQTL/Y3CSooWuCxIQMmJelSW6Kr7tLpZ4/GMr4V2RU0XO9tF # l3SRR/Q+w8IRtPsNNbmTB9wWJxcuyTHavrl6mG2DUy86UbJhoxjyYj7XUpiVyzbc # /UmbHLXdeg9QCayhiHtCvPfcJF8EWoqoYfKSTJrj3B2ysQo7aPVK3D2/cYGRQ80A # EssOD3IzC+QiBb30TzGJzJ5xaIMcaDZb61Hs7afYkhYUjQyqoQEh6ZxS8x0SCHFE # 8YsVkwNm47Iw9ySPhfIIZiTfxMcK8n2zN85rAlfonlWasblr9Ok= # =uM6z # -----END PGP SIGNATURE----- # gpg: Signature made Mon 06 Feb 2023 01:53:42 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-02-06panic: Consolidate open-coded panic_on_warn checksKees Cook
commit 79cc1ba7badf9e7a12af99695a557e9ce27ee967 upstream. Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-06mm: kasan: do not panic if both panic_on_warn and kasan_multishot setDavid Gow
commit be4f1ae978ffe98cc95ec49ceb95386fb4474974 upstream. KASAN errors will currently trigger a panic when panic_on_warn is set. This renders kasan_multishot useless, as further KASAN errors won't be reported if the kernel has already paniced. By making kasan_multishot disable this behaviour for KASAN errors, we can still have the benefits of panic_on_warn for non-KASAN warnings, yet be able to use kasan_multishot. This is particularly important when running KASAN tests, which need to trigger multiple KASAN errors: previously these would panic the system if panic_on_warn was set, now they can run (and will panic the system should non-KASAN warnings show up). Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Patricia Alfonso <trishalfonso@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200915035828.570483-6-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-6-davidgow@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01Merge tag 'v5.4.230' into v5.4/standard/baseBruce Ashfield
This is the 5.4.230 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmPPeCUACgkQONu9yGCS # aT7d6g/9G6IgNQKClKxhVaIvcumu4QJXxLE8VOKlMaaCu3Ehby3zEuIgwNkM02CO # DXzPmhHC/0KsCw4i/eNANhUndEBtrfl8esPBkTJRYQQV7NvOCEXa+73t68xnGgOY # mRrCGeWdwnpIezaoOWn0IbWSkeW/zAytiBywYbYlOiJ+8CwUFRozviZq/m45kFmg # AYgTMi1XYX+kxBDyhkWnHNttHyUwq0/lUySpvvaSQB8pIP8QFWf28y8lvQZbwmBT # w3CiM/WbG2iGIidf1YYDKN/yLJuWaqZe/SZhOWgH6odle7EgIUCRHTuEcpgENS1D # mYvBHcjeMv0KM2WOJVpspRyUMcfpDe/5ALJyNwqEg1f/U4Egjj/7NYtTYOukak4J # We4x67F0KiKta+CObF3gw/BhCbzI0WI7vuT+jgq+/MTTCeul1NacdCbiZQzHPucy # 2f6OOTrORzgHWJ4yHmEqdOOaDI/3qYBV7bLwZLdTGmCz9KozA9yCfEXufF88XH4l # wrpUsHty5kUMCcgVwEAdea0+xbVtXxQwR5vZ54dt/cIxpBgW60RsFJp4lS5+OC4C # 438hOpon/uupn2miIlLBi91ImXo/dofAUT1BoK/3gK50TGgslCJwOM6NzCvu67he # Ssy59p+VKxUhzRjLOQG+82Ec9mSMGew5Pq0wTNwbsti40F41pnk= # =YlBJ # -----END PGP SIGNATURE----- # gpg: Signature made Tue 24 Jan 2023 01:18:13 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-01-25Merge tag 'v5.4.229' into v5.4/standard/baseBruce Ashfield
This is the 5.4.229 stable release # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmPHzUQACgkQONu9yGCS # aT7QohAAtxV33qGSKGUdKMZk1JzIYuc8tAa+CHZhTi6xjTsoy1a5MlQGrj8a9YQ7 # /5VvwslGSn29h/ThO/ai04CfeOsWugMtnuo4mT4+198DgH0CNQMlfWq2c25cCvY6 # dIrrMTA7B2YhpdbjM4vkX8QIAxBVCHOVkseSammhMnujP7d+k4LtC6rRV4uiF+lD # cKtsIJn8h+pezBeo5+pjvcTwndaAoApVOES4uOjJcf9pYOOoHxyi+8StpiO+j2Pv # sRvkbvvmpS+IWAH+DMa3SAFI3C3AihX2Fu0rIFzUZByAviB1NmyWluX5mU54wW3R # P80fl0rQFwuygEBU1UqTXe4hQ8YYwpJGAQzbLR22a11IT2MSO+vMRINdqG1un2BE # T9hHix5R0JMeIN9AP7nKGBLrEZ3V6DqxEBz6ZC1sOUIIVQv93twtiwb0rNM0e7pq # PpkIXpwXPIgqFDGXrd0y5ksRT08jJUKCRttuRVWkcGX8adotngWnrl0WBI5zqSuo # B+x8X9Dw7YblJ6yQ+8mAZGk0Mj3j+cb4uhuRaz/6rqHmFOrbHm+JDXvPzZY65xy3 # k8Ebtq5CxINLDwahfb/o13MgbmzMPPNPPp0cz23zOhm88OmwVzB4hAoB/1CfHZvF # XhSbZMVBhhP9hYr2gYl902EQeZGE5yjk5xhFT5Wrh7QoZaPW2XM= # =as6n # -----END PGP SIGNATURE----- # gpg: Signature made Wed 18 Jan 2023 05:43:16 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2023-01-24mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vmaHugh Dickins
commit ab0c3f1251b4670978fde0bd54161795a139b060 upstream. uprobe_write_opcode() uses collapse_pte_mapped_thp() to restore huge pmd, when removing a breakpoint from hugepage text: vma->anon_vma is always set in that case, so undo the prohibition. And MADV_COLLAPSE ought to be able to collapse some page tables in a vma which happens to have anon_vma set from CoWing elsewhere. Is anon_vma lock required? Almost not: if any page other than expected subpage of the non-anon huge page is found in the page table, collapse is aborted without making any change. However, it is possible that an anon page was CoWed from this extent in another mm or vma, in which case a concurrent lookup might look here: so keep it away while clearing pmd (but perhaps we shall go back to using pmd_lock() there in future). Note that collapse_pte_mapped_thp() is exceptional in freeing a page table without having cleared its ptes: I'm uneasy about that, and had thought pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem, and we would have to move the mmu_notification if clearing those ptes. What this fixes is not a dangerous instability. But I suggest Cc stable because uprobes "healing" has regressed in that way, so this should follow 8d3c106e19e8 into those stable releases where it was backported (and may want adjustment there - I'll supply backports as needed). Link: https://lkml.kernel.org/r/b740c9fb-edba-92ba-59fb-7a5592e5dfc@google.com Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18mm: Always release pages to the buddy allocator in memblock_free_late().Aaron Thompson
commit 115d9d77bb0f9152c60b6e8646369fa7f6167593 upstream. If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages() only releases pages to the buddy allocator if they are not in the deferred range. This is correct for free pages (as defined by for_each_free_mem_pfn_range_in_zone()) because free pages in the deferred range will be initialized and released as part of the deferred init process. memblock_free_pages() is called by memblock_free_late(), which is used to free reserved ranges after memblock_free_all() has run. All pages in reserved ranges have been initialized at that point, and accordingly, those pages are not touched by the deferred init process. This means that currently, if the pages that memblock_free_late() intends to release are in the deferred range, they will never be released to the buddy allocator. They will forever be reserved. In addition, memblock_free_pages() calls kmsan_memblock_free_pages(), which is also correct for free pages but is not correct for reserved pages. KMSAN metadata for reserved pages is initialized by kmsan_init_shadow(), which runs shortly before memblock_free_all(). For both of these reasons, memblock_free_pages() should only be called for free pages, and memblock_free_late() should call __free_pages_core() directly instead. One case where this issue can occur in the wild is EFI boot on x86_64. The x86 EFI code reserves all EFI boot services memory ranges via memblock_reserve() and frees them later via memblock_free_late() (efi_reserve_boot_services() and efi_free_boot_services(), respectively). If any of those ranges happens to fall within the deferred init range, the pages will not be released and that memory will be unavailable. For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: v6.2-rc2: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 # +43,949 pages Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Signed-off-by: Aaron Thompson <dev@aaront.org> Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1ba989-000000@us-west-2.amazonses.com Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18mm, compaction: fix fast_isolate_around() to stay within boundariesNARIBAYASHI Akira
commit be21b32afe470c5ae98e27e49201158a47032942 upstream. Depending on the memory configuration, isolate_freepages_block() may scan pages out of the target range and causes panic. Panic can occur on systems with multiple zones in a single pageblock. The reason it is rare is that it only happens in special configurations. Depending on how many similar systems there are, it may be a good idea to fix this problem for older kernels as well. The problem is that pfn as argument of fast_isolate_around() could be out of the target range. Therefore we should consider the case where pfn < start_pfn, and also the case where end_pfn < pfn. This problem should have been addressd by the commit 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone") but there was an oversight. Case1: pfn < start_pfn <at memory compaction for node Y> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ end_pfn ^ start_pfn = cc->zone->zone_start_pfn pfn <---------> scanned range by "Scan After" Case2: end_pfn < pfn <at memory compaction for node X> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ pfn ^ end_pfn start_pfn <---------> scanned range by "Scan Before" It seems that there is no good reason to skip nr_isolated pages just after given pfn. So let perform simple scan from start to end instead of dividing the scan into "Before" and "After". Link: https://lkml.kernel.org/r/20221026112438.236336-1-a.naribayashi@fujitsu.com Fixes: 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone"). Signed-off-by: NARIBAYASHI Akira <a.naribayashi@fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-19Merge tag 'v5.4.228' into v5.4/standard/baseBruce Ashfield
This is the 5.4.228 stable release # gpg: Signature made Mon 19 Dec 2022 06:25:14 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-12-19Merge tag 'v5.4.227' into v5.4/standard/baseBruce Ashfield
This is the 5.4.227 stable release # gpg: Signature made Wed 14 Dec 2022 05:30:54 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-12-19Merge tag 'v5.4.226' into v5.4/standard/baseBruce Ashfield
This is the 5.4.226 stable release # gpg: Signature made Thu 08 Dec 2022 05:23:11 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-12-19mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb pageBaolin Wang
commit fac35ba763ed07ba93154c95ffc0c4a55023707f upstream. On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified. So when looking up a CONT-PTE size hugetlb page by follow_page(), it will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size hugetlb in follow_page_pte(). However this pte entry lock is incorrect for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get the correct lock, which is mm->page_table_lock. That means the pte entry of the CONT-PTE size hugetlb under current pte lock is unstable in follow_page_pte(), we can continue to migrate or poison the pte entry of the CONT-PTE size hugetlb, which can cause some potential race issues, even though they are under the 'pte lock'. For example, suppose thread A is trying to look up a CONT-PTE size hugetlb page by move_pages() syscall under the lock, however antoher thread B can migrate the CONT-PTE hugetlb page at the same time, which will cause thread A to get an incorrect page, if thread A also wants to do page migration, then data inconsistency error occurs. Moreover we have the same issue for CONT-PMD size hugetlb in follow_huge_pmd(). To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte() to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to get the correct pte entry lock to make the pte entry stable. Mike said: Support for CONT_PMD/_PTE was added with bb9dd3df8ee9 ("arm64: hugetlb: refactor find_num_contig()"). Patch series "Support for contiguous pte hugepages", v4. However, I do not believe these code paths were executed until migration support was added with 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages") I would go with 5480280d3f2d for the Fixes: targe. Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com Fixes: 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> [5.4: Fixup contextual diffs before pin_user_pages()] Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14mm/gup: fix gup_pud_range() for daxJohn Starks
commit fcd0ccd836ffad73d98a66f6fea7b16f735ea920 upstream. For dax pud, pud_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Commit 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") fixed devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as well. This fixes the below kernel panic: general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP < snip > Call Trace: <TASK> get_user_pages_fast+0x1f/0x40 iov_iter_get_pages+0xc6/0x3b0 ? mempool_alloc+0x5d/0x170 bio_iov_iter_get_pages+0x82/0x4e0 ? bvec_alloc+0x91/0xc0 ? bio_alloc_bioset+0x19a/0x2a0 blkdev_direct_IO+0x282/0x480 ? __io_complete_rw_common+0xc0/0xc0 ? filemap_range_has_page+0x82/0xc0 generic_file_direct_write+0x9d/0x1a0 ? inode_update_time+0x24/0x30 __generic_file_write_iter+0xbd/0x1e0 blkdev_write_iter+0xb4/0x150 ? io_import_iovec+0x8d/0x340 io_write+0xf9/0x300 io_issue_sqe+0x3c3/0x1d30 ? sysvec_reschedule_ipi+0x6c/0x80 __io_queue_sqe+0x33/0x240 ? fget+0x76/0xa0 io_submit_sqes+0xe6a/0x18d0 ? __fget_light+0xd1/0x100 __x64_sys_io_uring_enter+0x199/0x880 ? __context_tracking_enter+0x1f/0x70 ? irqentry_exit_to_user_mode+0x24/0x30 ? irqentry_exit+0x1d/0x30 ? __context_tracking_exit+0xe/0x70 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7fc97c11a7be < snip > </TASK> ---[ end trace 48b2e0e67debcaeb ]--- RIP: 0010:internal_get_user_pages_fast+0x340/0x990 < snip > Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com Fixes: 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") Signed-off-by: John Starks <jostarks@microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream. memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14mm/khugepaged: invoke MMU notifiers in shmem/file collapse pathsJann Horn
commit f268f6cf875f3220afc77bdd0bf1bb136eb54db9 upstream. Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [manual backport: this code was refactored from two copies into a common helper between 5.15 and 6.0] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn
commit 2ba99c5e08812494bc57f319fb562f527d9bacd8 upstream. Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [manual backport: two of the three places in khugepaged that can free ptes were refactored into a common helper between 5.15 and 6.0; TLB flushing was refactored between 5.4 and 5.10] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14mm/khugepaged: take the right locks for page table retractionJann Horn
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream. pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [manual backport: this code was refactored from two copies into a common helper between 5.15 and 6.0] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-08v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() failsLinus Torvalds
commit 6647e76ab623b2b3fb2efe03a86e9c9046c52c33 upstream. The V4L2_MEMORY_USERPTR interface is long deprecated and shouldn't be used (and is discouraged for any modern v4l drivers). And Seth Jenkins points out that the fallback to VM_PFNMAP/VM_IO is fundamentally racy and dangerous. Note that it's not even a case that should trigger, since any normal user pointer logic ends up just using the pin_user_pages_fast() call that does the proper page reference counting. That's not the problem case, only if you try to use special device mappings do you have any issues. Normally I'd just remove this during the merge window, but since Seth pointed out the problem cases, we really want to know as soon as possible if there are actually any users of this odd special case of a legacy interface. Neither Hans nor Mauro seem to think that such mis-uses of the old legacy interface should exist. As Mauro says: "See, V4L2 has actually 4 streaming APIs: - Kernel-allocated mmap (usually referred simply as just mmap); - USERPTR mmap; - read(); - dmabuf; The USERPTR is one of the oldest way to use it, coming from V4L version 1 times, and by far the least used one" And Hans chimed in on the USERPTR interface: "To be honest, I wouldn't mind if it goes away completely, but that's a bit of a pipe dream right now" but while removing this legacy interface entirely may be a pipe dream we can at least try to remove the unlikely (and actively broken) case of using special device mappings for USERPTR accesses. This replaces it with a WARN_ONCE() that we can remove once we've hopefully confirmed that no actual users exist. NOTE! Longer term, this means that a 'struct frame_vector' only ever contains proper page pointers, and all the games we have with converting them to pages can go away (grep for 'frame_vector_to_pages()' and the uses of 'vec->is_pfns'). But this is just the first step, to verify that this code really is all dead, and do so as quickly as possible. Reported-by: Seth Jenkins <sethjenkins@google.com> Acked-by: Hans Verkuil <hverkuil@xs4all.nl> Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-30Merge tag 'v5.4.225' into v5.4/standard/baseBruce Ashfield
This is the 5.4.225 stable release # gpg: Signature made Fri 25 Nov 2022 11:43:12 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-11-25mm: fs: initialize fsdata passed to write_begin/write_end interfaceAlexander Potapenko
commit 1468c6f4558b1bcd92aa0400f2920f9dc7588402 upstream. Functions implementing the a_ops->write_end() interface accept the `void *fsdata` parameter that is supposed to be initialized by the corresponding a_ops->write_begin() (which accepts `void **fsdata`). However not all a_ops->write_begin() implementations initialize `fsdata` unconditionally, so it may get passed uninitialized to a_ops->write_end(), resulting in undefined behavior. Fix this by initializing fsdata with NULL before the call to write_begin(), rather than doing so in all possible a_ops implementations. This patch covers only the following cases found by running x86 KMSAN under syzkaller: - generic_perform_write() - cont_expand_zero() and generic_cont_expand_simple() - page_symlink() Other cases of passing uninitialized fsdata may persist in the codebase. Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>