-
Notifications
You must be signed in to change notification settings - Fork 3
[6.18] Track gaming patches #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: base-6.18
Are you sure you want to change the base?
Conversation
Contains:
- PCI: Add Intel remapped NVMe device support
Consumer products that are configured by default to run the Intel SATA AHCI
controller in "RAID" or "Intel RST Premium With Intel Optane System
Acceleration" mode are becoming increasingly prevalent.
Unde this mode, NVMe devices are remapped into the SATA device and become
hidden from the PCI bus, which means that Linux users cannot access their
storage devices unless they go into the firmware setup menu to revert back
to AHCI mode - assuming such option is available. Lack of support for this
mode is also causing complications for vendors who distribute Linux.
Add support for the remapped NVMe mode by creating a virtual PCI bus,
where the AHCI and NVMe devices are presented separately, allowing the
ahci and nvme drivers to bind in the normal way.
Unfortunately the NVMe device configuration space is inaccesible under
this scheme, so we provide a fake one, and hope that no DeviceID-based
quirks are needed. The interrupt is shared between the AHCI and NVMe
devices.
Allow pci_real_dma_dev() to traverse back to the real DMA device from
the PCI devices created on our virtual bus, in case the iommu driver
will be involved with data transfers here.
The existing ahci driver is modified to not claim devices where remapped
NVMe devices are present, allowing this new driver to step in.
The details of the remapping scheme came from patches previously
posted by Dan Williams and the resulting discussion.
https://phabricator.endlessm.com/T24358
https://phabricator.endlessm.com/T29119
Signed-off-by: Daniel Drake <drake@endlessm.com>
- PCI: Fix order of remapped NVMe devices
Signed-off-by: Kai Krakow <kai@kaishome.de>
There's plenty of room on the stack for a few more inlined bytes here and there. The measured stack usage at runtime is still safe without this, and performance is surely improved at a microscopic level, so remove it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
From ClearLinux's own patches, disable both AVX2 and tree vectorization when using O3 and higher than generic amd64 architectures. Source: https://github.com/clearlinux-pkgs/linux/blob/main/0133-novector.patch Signed-off-by: Kai Krakow <kai@kaishome.de>
ATA init is the long pole in the boot process, and its asynchronous. move the graphics init after it so that ata and graphics initialize in parallel Signed-off-by: Kai Krakow <kai@kaishome.de>
Significant time was spent on synchronize_rcu in evdev_detach_client
when applications closed evdev devices. Switching VT away from a
graphical environment commonly leads to mass input device closures,
which could lead to noticable delays on systems with many input devices.
Replace synchronize_rcu with call_rcu, deferring reclaim of the evdev
client struct till after the RCU grace period instead of blocking the
calling application.
While this does not solve all slow evdev fd closures, it takes care of a
good portion of them, including this simple test:
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
int idx, fd;
const char *path = "/dev/input/event0";
for (idx = 0; idx < 1000; idx++) {
if ((fd = open(path, O_RDWR)) == -1) {
return -1;
}
close(fd);
}
return 0;
}
Time to completion of above test when run locally:
Before: 0m27.111s
After: 0m0.018s
Signed-off-by: Kenny Levinsen <kl@kl.wtf>
Per [Fedora][1], they intend to change the default max map count for their distribution to improve OOTB compatibility with games played through Steam/Proton. The value they picked comes from the Steam Deck, which defaults to INT_MAX - MAPCOUNT_ELF_CORE_MARGIN. Since most ZEN and Liquorix users probably play games, follow Valve's lead and raise this value to their default. [1]: https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount Signed-off-by: Kai Krakow <kai@kaishome.de>
Contains:
- mm: Stop kswapd early when nothing's waiting for it to free pages
Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
- mm: Don't stop kswapd on a per-node basis when there are no waiters
The page allocator wakes all kswapds in an allocation context's allowed
nodemask in the slow path, so it doesn't make sense to have the kswapd-
waiter count per each NUMA node. Instead, it should be a global counter
to stop all kswapds when there are no failed allocation requests.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
- mm: Increment kswapd_waiters for throttled direct reclaimers
Throttled direct reclaimers will wake up kswapd and wait for kswapd to
satisfy their page allocation request, even when the failed allocation
lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd
may think that there are no waiters and thus exit prematurely, causing
throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on
waiting for kswapd to wake them up. Incrementing the kswapd_waiters
counter when such direct reclaimers become throttled fixes the problem.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
This patch disabled the staggered spinup used for HDDs. The goal is to make boot times faster on systems with the small downside of a small spike in power consumption. Systems with a bunch of HDDs would see considerable faster boots This does make sense in the zen kernel as its supposed to be a kernel specialized for desktop performance, and faster boot times does fit into that description Signed-off-by: Pedro Montes Alcalde <pedro.montes.alcalde@gmail.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
In case of a multi-queue device, the code pointlessly loaded the default elevator just to drop it again. Signed-off-by: Kai Krakow <kai@kaishome.de>
Fall back straight to none instead of mq-deadline. Some benchmarks in a [recent paper][1] suggest that mq-deadline has too much lock contention, hurting throughput and eating CPU waiting for spinlocks. [1]: https://research.spec.org/icpe_proceedings/2024/proceedings/p154.pdf Signed-off-by: Kai Krakow <kai@kaishome.de>
Use [defer+madvise] as default khugepaged defrag strategy: For some reason, the default strategy to respond to THP fault fallbacks is still just madvise, meaning stall if the program wants transparent hugepages, but don't trigger a background reclaim / compaction if THP begins to fail allocations. This creates a snowball affect where we still use the THP code paths, but we almost always fail once a system has been active and busy for a while. The option "defer" was created for interactive systems where THP can still improve performance. If we have to fallback to a regular page due to an allocation failure or anything else, we will trigger a background reclaim and compaction so future THP attempts succeed and previous attempts eventually have their smaller pages combined without stalling running applications. We still want madvise to stall applications that explicitely want THP, so defer+madvise _does_ make a ton of sense. Make it the default for interactive systems, especially if the kernel maintainer left transparent hugepages on "always". Reasoning and details in the original patch: https://lwn.net/Articles/711248/ Signed-off-by: Kai Krakow <kai@kaishome.de>
5.7: Take "sysctl_sched_nr_migrate" tune from early XanMod builds of 128. As of 5.7, XanMod uses 256 but that may affect applications that require timely response to IRQs. 5.15: Per [a comment][1] on our ZEN INTERACTIVE commit, reducing the cost of migration causes the system less responsive under high load. Most likely the combination of reduced migration cost + the higher number of tasks that can be migrated at once contributes to this. To better handle this situation, restore the mainline migration cost value and also reduce the max number of tasks that can be migrated in batch from 128 to 64. If this doesn't help, we'll restore the reduced migration cost and keep total number of tasks that can be migrated at once to 32. [1]: zen-kernel@be5ba23#commitcomment-63159674 6.6: Port the tuning to EEVDF, which removed a couple of settings. 6.7: Instead of increasing the number of tasks that migrate at once, migrate the amount acceptable for PREEMPT_RT, but reduce the cost so migrations occur more often. This should make CFS/EEVDF behave more like out-of-tree schedulers that aggressively use idle cores to reduce latency, but without the jank caused by rebalancing too many tasks at once. Signed-off-by: Kai Krakow <kai@kaishome.de>
4.10: During some personal testing with the Dolphin emulator, MuQSS has serious problems scaling its frequencies causing poor performance where boosting the CPU frequencies would have fixed them. Reducing the up_threshold to 45 with MuQSS appears to fix the issue, letting the introduction to "Star Wars: Rogue Leader" run at 100% speed versus about 80% on my test system. Also, lets refactor the definitions and include some indentation to help the reader discern what the scope of all the macros are. 5.4: On the last custom kernel benchmark from Phoronix with Xanmod, Michael configured all the kernels to run using ondemand instead of the kernel's [default selection][1]. This reminded me that another option outside of the kernels control is the user's choice to change the cpufreq governor, for better or for worse. In Liquorix, performance is the default governor whether you're running acpi-cpufreq or intel-pstate. I expect laptop users to install TLP or LMT to control the power balance on their system, especially when they're plugged in or on battery. However, it's pretty clear to me a lot of people would choose ondemand over performance since it's not obvious it has huge performance ramifications with MuQSS, and ondemand otherwise is "good enough" for most people. Lets codify lower up thresholds for MuQSS to more closely synergize with its aggressive thread migration behavior. This way when ondemand is configured, you get sort of a "performance-lite" type of result but with the power savings you expect when leaving the running system idle. [1]: https://www.phoronix.com/scan.php?page=article&item=xanmod-2020-kernel 5.14: Although CFS and similar schedulers (BMQ, PDS, and CacULE), reuse a lot more of mainline scheduling and do a good job of pinning single threaded tasks to their respective core, there's still applications that confusingly run steady near 50% and benefit from going full speed or turbo when they need to run (emulators for more recent consoles come to mind). Drop the up threshold for all non-MuQSS schedulers from 80/95 to 55/60. 5.15: Remove MuQSS cpufreq configuration. Signed-off-by: Kai Krakow <kai@kaishome.de>
This option is already disabled when CONFIG_PREEMPT_RT is enabled, lets turn it off when CONFIG_ZEN_INTERACTIVE is set as well. Signed-off-by: Kai Krakow <kai@kaishome.de>
What watermark boosting does is preemptively fire up kswapd to free memory when there hasn't been an allocation failure. It does this by increasing kswapd's high watermark goal and then firing up kswapd. The reason why this causes freezes is because, with the increased high watermark goal, kswapd will steal memory from processes that need it in order to make forward progress. These processes will, in turn, try to allocate memory again, which will cause kswapd to steal necessary pages from those processes again, in a positive feedback loop known as page thrashing. When page thrashing occurs, your system is essentially livelocked until the necessary forward progress can be made to stop processes from trying to continuously allocate memory and trigger kswapd to steal it back. This problem already occurs with kswapd *without* watermark boosting, but it's usually only encountered on machines with a small amount of memory and/or a slow CPU. Watermark boosting just makes the existing problem worse enough to notice on higher spec'd machines. Disable watermark boosting by default since it's a total dumpster fire. I can't imagine why anyone would want to explicitly enable it, but the option is there in case someone does. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Per an [issue][1] on the chromium project, swap-in readahead causes more jank than not. This might be caused by poor optimization on the swapping code, or the fact under memory pressure, we're pulling in pages we don't need, causing more swapping. Either way, this is mainline/upstream to Chromium, and ChromeOS developers care a lot about system responsiveness. Lets implement the same change so Zen Kernel users benefit. [1]: https://bugs.chromium.org/p/chromium/issues/detail?id=263561 Signed-off-by: Kai Krakow <kai@kaishome.de>
Partially overrides ZEN interactive adjustments if enabled. Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Link: https://aur.archlinux.org/packages/linux-cachyos-bore
When selecting an idle CPU for a task, always try to prioritize full-idle SMT cores (CPUs belonging to an SMT core where all its sibling are idle) over partially-idle cores. Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Add an option to wait on multiple futexes using the old interface, that uses opcode 31 through futex() syscall. Do that by just translation the old interface to use the new code. This allows old and stable versions of Proton to still use fsync in new kernel releases. Signed-off-by: André Almeida <andrealmeid@collabora.com>
v2: ported from 6.1 to 6.6 v3: ported from 6.12 to 6.18 Signed-off-by: Kai Krakow <kai@kaishome.de>
v2: ported from 6.1 to 6.6 Signed-off-by: Kai Krakow <kai@kaishome.de>
|
Added note about performance characteristics to the top post:
|
Export patch series: https://github.com/kakra/linux/pull/44.patch
Gaming und Desktop Interactivity Patches
This is a combined patchset from various sources, including some ports of old legacy Proton features (deprecated). The patchset mostly focuses around memory management and interactivity, optimizing the system for responsiveness rather than maximum throughput, to reduce stutters, memory thrashing and frame drops. Most unrelated ZEN patches have been dropped.
Changes since 6.12 LTS
ACTION=="add|change", KERNEL=="ntsync", SUBSYSTEM=="misc", MODE="0666"to set proper permissions)Included Patches
Recommendation
Kernel 6.18 LTS seems to have worse performance characteristics for games compared to kernel 6.12, especially frame pacing can jitter a lot more, resulting in reduced perceived smoothness despite average FPS being identical or better. I found that I can fix it by building the kernel with support for
sched_ext, and then running with the LAVD scheduler which also yields lower power consumption when idle (sys-app/kernelin Gentoo):Here's a talk on the technical details of the LAVD scheduler which initially has been made for the Steam Deck: https://www.youtube.com/watch?v=5F-vQgv4sI0
Deprecated