Ryzen, Part 5

Well, well, well, after 71 days uptime the system can be considered stable:

Disabling the C6-sleep-state did it:

zenstates.py --c6-disable

Since last reboot the spectre/meltdown disaster happened. DKMS didn’t work any more, because archlinux updated gcc to 7.3 with retpoline support, so it blatantly refused to compile the nvidia module with a different compiler than the kernel was compiled with. That wasn’t really a problem, because X still restarted properly.

But there were other problems like systemd asking for passwords when it shouldn’t, dbus out of date and so on. Well, that’s what you get when running a rolling distribution. So I thought it was time to schedule a kernel re-compile and a reboot.

The re-compile was harder than expected. To put it short:

  • When following the official documentation, do the checkout in an empty directory
  • edit prepare() to do make oldconfig and make menuconfig in this order
  • Don’t forget to uncomment and change pkgbase to linux-ryzen!
  • Don’t makpkg -s on an encrypted volume 🙂
  • If DKMS complains about a compiler mismatch on pacman -U, do IGNORE_CC_MISMATCH=1 pacman -U …

After a successful reboot I decided to install the fallow 16 GB of ram I initially purchased, since RAM timings weren’t really the problem. Now I have a workstation with whopping 32GB RAM:

# free -m 
       total        used        free...
Mem:   32167        5521        8478...
Swap:  16382           0       16382

I kept the RCU_* setting, so let’s see how it turns out. Keeping fingers crossed!

The whole SheBang!

Ryzen, Part 2

Introduction

Since the arrival of my new Ryzen system I had problems with random crashes. Seems it’s a memory timing problem, at least that’s what Google suggests. I bought a 16GB Corsair Vengeance memory kit, consisting of 2x8GB RAM bars, a suggestion from a colleague. She’s using the same ASUS-board with that memory without problems, but her box isn’t running 24/7. It’s a Win10 Gaming PC.

Take 1

First thing to do: Update the BIOS, or UEFI, as it’s called ‘nowadays. Of course that didn’t help. Also, the changelog wasn’t very helpful (“improving stability”), thanks, ASUS!

So I ventured out and and did some research on overclocking and memory timing. Turns out that Intel invented XMP (Extreme Memory Profile). Basically, it’s timing data stored on the memory bar, read and used by the BIOS/UEFI. AMD didn’t want to pay the license fee, so they called their version AMP. ASUS, the mainboard manufacturer, called its implementation of AMP D.O.C.P. (Direct Over Clock Profile).

Full of hope, I turned on D.O.C.P. It set the timing data to the suggested values from this document. It didn’t help. Guess when I bought the new gear:

The crashes have absolutely nothing to do with CPU load or temperature, quite the opposite. Mostly, they happen during the night, after backup 🙁

Take 2

“What the hell”, I figured, and bought another memory kit. This time 16GB G.Skill Flare X for only € 197,46. After that the box ran for whopping 5 days without a crash, yay! It still crashed, though. This time I didn’t have X running and saw a kernel message that PID 1 (systemd) was stuck: “Soft lockup: CPU#3 stuck for 23s” (or similar). As always, I could only recover the box by hardware reset.

Take 2.5

So today (2017/11/24) I updated to the latest BIOS/UEFI (PRIME-X370-PRO-ASUS-3203.CAP) and down-clocked the memory to 2199 or 2133Mhz, don’t remember the exact number. Let’s see how that turns out!

[To be (dis)-continued…]