Ryzen, Part 3

Take 3

Of course neither the new memory nor the BIOS update fixed the random crashes, but fortunately I’ve got another hint: Maybe it’s a linux-specific problem, and not one with memory timing. This post suggests that it’s a problem with C-States and RCU. Adding the following kernel command line parameters should help:

processor.max_cstate=1
rcu_nocbs=0-11

As always, nothing is as simple as it seems 🙁

The C-States

For max_cstate to take effect, I had to actually enable Global C-State-Control in the UEFI-thingy of my mobo! The default was Auto, which in turn defaulted to Disabled. After enabling it, dmesg reported this:

ACPI: ACPI: processor limited to max C-state 1

Before that, there was no mentioning of C-states in the kernel log, so I doubt that it has any impact, but one should never give up hope!

The RCU-Thingy

What it is (quoting Paul E. McKenney from LKML):

Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU achieves scalability improvements by allowing reads to occur concurrently with updates.

It’s much more likely to be the cause of the problem since I once saw something like this after a reboot on the console:

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DOM Worker:1364]

The process was “systemd” instead of “DOM Worker”, but that shouldn’t matter. Anyway, the parameter rcu_nocbs=0-11 has no effect if the kernel config option CONFIG_RCU_NOCB_CPU is not set. Guess what, in the stock archlinux kernel it’s unset, lucky me!

So I ventured out to compile a custom kernel for arch. That turned out to be disappointingly easy! The available documentation just works ™! Four hints, though:

  1. Uncomment “make menuconfig”
  2. Edit /etc/makepkg.conf and set MAKEFLAGS to “-j<no-processors+1>” to get parallel builds
  3. If you have a nvidia graphics card and use the proprietary driver, keep in mind that you have to rebuild that one, too. Once again, that was ridiculously easy. Just install nvidia-dkms before updating the kernel with pacman -U and it will be built automagically when you install the new kernel! Detailed instructions are here.
  4. Don’t forget to update your grub-config with grub-mkconfig before rebooting!
Offload RCU callbacks from CPUs: 0-11.

should be in dmesg after a reboot, otherwise it didn’t work. I’m waiting with baited breath how it turns out!

Previous parts of my adventure: Part 1, Part 2.

Backup with Borg

Yes, resistance is futile! Borg is a descendant of attic with is a descendant of obnam. Development of the latter has ceased, so I needed a new victim. borg is a compressing, deduplicating, file based backup program which does more or less the same than obnam.

Despite its lineage borg is a bit different. It kinda does not work via ssh, only by FUSE and a sshfs-usermount, which is quite a bad choice, as I learned. It’s much faster if you run borg on the source, i.e. “where the data is” and transfer the result to the backup repository. Thinking about it, that’s not much of a surprise.

If you use sshfs, the destination has to pull all data for deduplication and compression, so no bandwidth is spared. If you run a borg server and only send the result of deduplication and compression over the wire, it’s much faster, what a surprise! So, always run borg at the source!

 

Ryzen, Part 2

Introduction

Since the arrival of my new Ryzen system I had problems with random crashes. Seems it’s a memory timing problem, at least that’s what Google suggests. I bought a 16GB Corsair Vengeance memory kit, consisting of 2x8GB RAM bars, a suggestion from a colleague. She’s using the same ASUS-board with that memory without problems, but her box isn’t running 24/7. It’s a Win10 Gaming PC.

Take 1

First thing to do: Update the BIOS, or UEFI, as it’s called ‘nowadays. Of course that didn’t help. Also, the changelog wasn’t very helpful (“improving stability”), thanks, ASUS!

So I ventured out and and did some research on overclocking and memory timing. Turns out that Intel invented XMP (Extreme Memory Profile). Basically, it’s timing data stored on the memory bar, read and used by the BIOS/UEFI. AMD didn’t want to pay the license fee, so they called their version AMP. ASUS, the mainboard manufacturer, called its implementation of AMP D.O.C.P. (Direct Over Clock Profile).

Full of hope, I turned on D.O.C.P. It set the timing data to the suggested values from this document. It didn’t help. Guess when I bought the new gear:

The crashes have absolutely nothing to do with CPU load or temperature, quite the opposite. Mostly, they happen during the night, after backup 🙁

Take 2

“What the hell”, I figured, and bought another memory kit. This time 16GB G.Skill Flare X for only € 197,46. After that the box ran for whopping 5 days without a crash, yay! It still crashed, though. This time I didn’t have X running and saw a kernel message that PID 1 (systemd) was stuck: “Soft lockup: CPU#3 stuck for 23s” (or similar). As always, I could only recover the box by hardware reset.

Take 2.5

So today (2017/11/24) I updated to the latest BIOS/UEFI (PRIME-X370-PRO-ASUS-3203.CAP) and down-clocked the memory to 2199 or 2133Mhz, don’t remember the exact number. Let’s see how that turns out!

[To be (dis)-continued…]

Upgrading check-mk and Debian

1. Overview

Upgrading Debian from 8 (jessie) to 9 (stretch) with check-mk installed isn’t as easy as it seems. You have to:

  1. Upgrade check-mk to 1.4.0 and fix all issues
  2. Backup all sites
  3. Purge check-mk
  4. Upgrade debian
  5. Reinstall check-mk 1.4.0
  6. Restore the check-mk-sites from backup

2. Upgrade check-mk and back it up

Download the .deb package and install it. Follow the official the official guide and upgrade all sites. After fixing all issues, create a backup of each site:

# su - <sitename>
$ omd backup site.name.tar.gz

Repeat this for all sites.

3. Remove check-mk and upgrade to stretch

Stop sites:

# su - <sitename>
$ omd stop

Repeat this for all sites. Then remove (purge) check-mk:

# dpkg -P check-mk-raw-1.4.0p17

Once this is done, update the distribution to debian (you really should know how to do that!). Autoremove all obsolete packages and reboot.

4. Reinstall check-mk and restore from backup

Download the .deb package for stretch and install it. Since you autoremoved dependent packages earlier, the install will most likely fail. Fix it with:

# apt --fix-broken install

Now we can restore the sites from our backup (as root!):

# omd restore <site-name.tar.gz>
# su - sitename
$ omd start

Repeat for all sites and fix all remaining issues.

5. Notes

Of course you don’t want to do this without a security net. Take a snapshot and destroy that instead of the real VM. How to do that with KVM and libvirt is explained here.