Sunday, January 21, 2024

“Our heat pump water heater isn’t heating!”

Dear Dad,

Last night, Carol called to me from the bathroom: “Our heat pump water heater isn’t heating!” Yep, we had one of those installed. You know, those things that run like a refrigerator in reverse? I still remember your explaining to me how the freon or whatever gets compressed and heated, then sprays into the part of the system inside the fridge, cooling everything. I also remember your telling me that I didn’t know how lucky I was, and you were right—I didn’t. But what I'm thinking of now is how lucky I was to have you as my father.

Anyway, I ran out to the patio, where the heat-pump water heater is, and opened the door to its closet. It was dark, where usually a LCD display shows what it's set to. H’m. I went to the subpanel in the house. The breaker had been tripped—I wondered why. I switched it to ON and immediately heard arcing. Oh no! I quickly turned it off. Why…?

The cover had one screw mostly off? Oh, the sheetrock guy had to remove the cover in order to put the wallboard on. There was joint compound around the edges, too. Right, the bathroom project still isn’t finished. Yes, we’re using a contractor. Unlike you, I don’t do all this stuff myself any more.

I looked closely at the breaker—was it a little cockeyed? It would be easy to accidentally jostle it while removing or replacing the cover.

I pulled out the (30A 240V) pair of breakers, and the one breaker’s “jaws” looked a little too wide; that was somehow unsurprising; it was the one grabbing the, uh, busbar with the dark deposits (from the arcing, I reckon).

No bueno. It was already late, was I going to have to run out and find…? Wait—didn’t I have a spare 30A 240V pair of breakers? Last summer? (fall?), when I thought our old oven might have been the victim of a flakey circuit breaker, I spent the $20 on a replacement pair, which I never installed. Good thing, too; the old breakers were just fine, as proven by the new oven’s flawless operation from day one.

And an even better thing: I had a brand-new pair of breakers to use on the heat pump! I made my way to the garage, found the breakers where I’d left them, and examined them to make sure I correctly remembered their rating. 30A—yes! The jaws had equal (to my eye) and narrow widths, and each pair of jaws also had a little bit of, ah, conductive toothpaste—at least that’s what it looked like—to promote solid contact with its busbar.

Now all I had to do was get the wires off the old breakers and onto the new ones. Wow, why are these screws so hard to turn? Was it because I was using a common screwdriver when I should have been using a square-drive? Modern technology! Fortunately, I had an S2 bit for the cordless screwdriver, which I bought just a couple weeks ago for another purpose. Out in the garage, I found the package exactly where I’d left it (wow, I should buy a lottery ticket). I pulled one off the card (I’d bought a pair) and grabbed my multi-driver tool with the appropriate hexagonal hole (a freebie from when I worked at hp over 20 years ago).

Screws sure turn more easily when you have the right blade. Got the wires off the old breakers and onto the new ones. I might have liked to clean the black deposits off the busbar, but nah, I didn’t want to try figuring out what to clean it with (something not made of metal) and besides, what harm would that stuff do? The conductive toothpaste would ensure a good bond.

Engaged the outer edge of the breakers, then pressed the inner edge all the way in. Turned the breaker to “ON.” Outside, I was greeted by glowing digits: 121°. I headed back in to button up.

I didn’t fully tighten the cover screws, since the joint compound wasn’t dry everywhere. Then I texted my general contractor, asking him to please tell the sheetrock guy that I had to replace the breaker, and that's why I had to touch that breaker panel. He’d certainly be able to tell that I touched it, and I wanted him to know why.

Dad, I’m so glad you taught me all the stuff you did. I truly am a lucky man. I just wish I could still pick up the phone and tell you about this little adventure. You’d commiserate with me and laugh (“Are you saying your wife called to you from the shower, and she just wanted you to fix the hot water?”). You’d agree it was a lucky thing I had not gotten around to returning the unused circuit breakers last summer or fall. You’d congratulate me on the quick diagnosis. I sure would have enjoyed all that. But mostly I would have just enjoyed telling you about it, knowing you understood my thought process.

Love you and miss you, Dad.

Tuesday, January 16, 2024

Suddenly my automounts don't work... and a hack fix

The server, p64, is debian 11 (bullseye); the client is mac os x (darwin kernel version 23.1.0 Mon Oct 9 21:27:27 PDT 2023. Automounts used to work but somehow stopped, I'm not sure when. I have homedirs on Linux, as /home; I want to read/write the Linux /home as, uh, /home.

After searching frantically, here is something that kinda works. First, on linux:

  • sudo systemctl start rpc-statd
  • sudo service rpcbind start
Then, on mac os, forget about automounting, just do it manually. As root:
mount -o resvport p64:/home /home
and it all works.

I want to figure out the "real" automounter problem, but right now I just want to do what I sat down to do.

Monday, January 30, 2023

Can't pull Kenmore 790.47892602 builtin oven from its cabinet?

Like the person in this doityourself.com post, I had a problem where the oven stopped heating after being put through a cleaning cycle.

If an ounce of prevention is worth a pound of cure, the prevention is: Never let it clean for more than TWO hours. The default 3-hour cleaning cycle always engages the thermal safety device; the oven won't heat at all until you reset it. Which is a real pain.

OK, for the cure. The overall goal is to reset the safety switch, which in our oven, looks like the photo at left; I reset it by pushing on the red button. The body of the switch is about an inch in diameter (your switch might look different). You get at the switch by pulling out the oven (you will need a stepladder or something to hold it up so it doesn't fall on some body part) and removing a big piece of sheet metal.

How do you pull the oven out? The book says (figure 7, page 6) to use a certain tool (which I don't have or can't find) to release the mounting bracket; see the diagram at right. Since I don't have the tool, I inserted a common screwdriver with a 4-inch blade, ⅛” wide. Insert into one hole, keeping the blade horizontal, and pull that side of the oven out ½–1”. Pull the screwdriver out and repeat on the other side. Where exactly is the hole? In the photo below on the left, my index finger shows where to insert the screwdriver; the photo on the right (or maybe below it) shows what that looks like with the oven removed.

The photo at left shows where the mounting bracket (above) engaged with the oven body; that's what keeps the oven from falling out when you open the door (I mean, even before you try to pull it out).

BEFORE YOU PULL THE OVEN OUT MORE THAN AN INCH OR SO, get a step-ladder or some other piece of furniture sturdy and stable enough to support the oven. There is danger of severe personal injury here. The book says the cabinet must be able to support 200 pounds. Avoid a trip to the emergency room and a lot of awkward explanations! I had both a step-ladder and a sturdy wooden patio-chair, to support two corners of the oven.

Once you get the oven pulled out, you remove the sheet-metal panel on the back. There are maybe 6–7 screws that hold it on: one on top, one on the bottom, and two or three on each side. I think I lost one of the screws on an earlier operation. Two of the screws hold on little black, uh, feet, maybe 3mm thick and maybe 1cm in diameter. I don't know how important they are, but they are there.

Once you remove those screws and stow the back cover (hint: with a black magic marker, write on the inside of the cover: "INSIDE"), you'll be able to reach the thermal switch, highlighted in the photo at right in magenta.

Installation is the reverse of removal.

BE SAFE! Remove the supports only after you get the oven pushed in far enough for the mounting brackets to engage the chassis (i.e. far enough that you can't pull the oven out).

Saturday, January 14, 2023

assimilating a new (to us) imac: SMTP mail

So the mac mini is almost a teenager so we got a new box. According to "About this Mac" it's "iMac (retina 5K..., 2019)" running macOS Monterey 12.0.1. I need to get dovecot on it, among other things.

<…time passes…>

OK that was... October maybe? I moved "all" the files with either scp or rsync, upgraded to Ventura, installed crashplan for small business and told it to back up Carol's files (and to stop backing up the old mac mini). Carol's been using the new machine to good effect for a few months now, but I'm still using the mini to fetch SMTP mail from my ISP. The setup is byzantine, and in case I'm still using that email when we replace the 2019 iMac, I'll record how it handles smtp mail for my future self.

Fetching the mail here

The mini runs a "service"... I thought we could "fetchmail -d 60" but how to send password encrypted? It would certainly be bad medicine authenticating in cleartext!

The solution involves an ssh tunnel and a macos "service." Apparently if you put an XML file named somethign.plist in /System/Library/LaunchDaemons/ then macos will run it as root on startup. Mine looked like this:

unknownc42c0321f10e:~ admin$ cat /System/Library/LaunchDaemons/collin.admin.tunnel.plist         
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
		<string>collin.admin.tunnel</string>
	<key>Program</key>
		<string>/Users/admin/tunnel.sh</string>
	<key>RunAtLoad</key> 
		<true/>
</dict>
</plist>
unknownc42c0321f10e:~ admin$ 
OK actually on the mini the username was "postman"; on the imac it'll be "admin" so I'm changing it here.

OOPS... on the iMac, running Ventura, we can't touch /System/Library/LaunchDaemons; instead I had to add the above as /Library/LaunchDaemons/collin.admin.tunnel.plist; I hope it works.

What does /Users/admin/tunnel.sh do? It establishes a tunnel to my ISP, making localhost port 60110 tunnel to the POP server's port 110 for about a minute or two. Then it runs fetchmail. Like this:

#!/bin/bash
ID=$(id -u)
if [[ $ID == 0 ]] ; then
    echo /Users/admin/tunnel.sh | su - admin
    exit 0
fi
PATH=$PATH:/usr/sbin:/opt/local/bin:/usr/bin:/bin
while :; do
        if netstat -an -finet|grep LISTEN | grep 60110 > /dev/null; then
                : be happy
        else
                ssh -f sonic -L 60110:pop.sonic.net:110 sleep 120 & >/dev/null
        fi
        sleep 30                # That should be long enough to open socket
        fetchmail --sslproto "" >> tmp/fetchmail.log 2>&1 &
        FPID=$!
        sleep 120
        kill $FPID
        sleep 10
done
I had a little surprise with the .fetchmailrc: I can't say
poll localhost proto pop3 port 60110 user ISP-username pass ISP-password is admin here fetchall mda "/usr/bin/sendmail -i -f %F %T"
because procmail won't let me fetch from localhost. So I have a hack in /etc/hosts:
127.0.0.1       localhost see.admin.fetchmailrc.invalid
and now fetchmail can, well, fetch the mail.

AND ANOTHER THING... I never used to have to say “--sslproto ""” but it now seems necessary lest I get some SSL error.

Once the mail gets here

… sendmail (or maybe postfix) will try to deliver it, probably to /var/spool/mail/WHATEVER. But we don't want that, so we have to supply a .forward file:
admin@Admins-iMac-2 ~ % cat .forward                                                           
"|/opt/local/bin/procmail"
admin@Admins-iMac-2 ~ %                                                                        
And a .procmailrc, which tries to figure out who the email is addressed to. If there's a header that says
To: collin@<ourdomain>
then that's easy; it's addressed to me.

But what if there's no header like that? What if I'm bcc:-ed? Basically we look for a useful Received:  header. Anyway, the point is, admin's .procmailrc file tries to figure out who the email is for, and then it sends the email to Carol or to me, or to the bit-bucket. It sends the email to us by running /usr/sbin/sendmail, so if I want email processed by procmail, I again have to have a $HOME/.forward, just like "admin" did. And my own $HOME/.procmailrc.

Other stuff

I have to run dovecot on the iMac, but only for Carol's email. She hasn't looked at it for months now, so when she decides to have a look, I'll probably have to figure out how to run dovecot on it.

As for me, I'll nfs-mount $HOME/Maildir from the iMac onto my linux box, which is where I read non-web email. The iMac wasn't exporting any filesystems when we brought it home, so I just did what came naturally: copy /etc/exports from the teen-aged mac mini:

admin@Admins-iMac-2 ~ % cat /etc/exports                                                       
/Users  -network 192.168.1.0 -mask 255.255.255.0
I'll mount that and symlink Maildir there to $HOME/Maildir on the Linux box.

Then I think I should remove /System/Library/LaunchDaemons/collin.postman.tunnel.plist from the mac mini... oh, wait, no, I don't have to do that; I can just make the script do nothing I think.

Then rsync to make the iMac's copy of $HOME/Maildir match the mac mini's copy... for both Carol and me

Admins-iMac-2:~ carol$ time rsync -av 192.168.1.131:Maildir ./
receiving file list ... done
Maildir/
Maildir/log
Maildir/msgid.cache
Maildir/new/
Maildir/new/1673726769.51227_2.unknownc42c0321f10e.attlocal.net
Maildir/new/1673737810.52745_2.unknownc42c0321f10e.attlocal.net
Maildir/new/1673746452.53955_2.unknownc42c0321f10e.attlocal.net
Maildir/tmp/

sent 66405 bytes  received 520272 bytes  7287.91 bytes/sec
total size is 675278927  speedup is 1151.02

real	1m20.414s
user	0m0.292s
sys	0m0.200s
Admins-iMac-2:~ carol$ 
Mine will take rather longer I think...

Then install

Monday, July 25, 2022

upgrade debian stretch → buster → bullseye

I hate upgrades, but python3 on my debian stretch box doesn't grok f-strings, because its python3 is python3.5; fstrings were added in python3.6. It’s been a couple years since I upgraded to stretch (debian9), and security updates have just been discontinued, so I thought, why not skip buster (debian10) and just upgrade to debian11 (bullseye)? Then maybe I could wait four years rather than two before having to do it again.

Of course I didn't see any instructions for a 2-release upgrade, so the first thing was to upgrade stretch to buster. I basically followed the instructions in this article. As root:

  • change “stretch” to “buster” in /etc/apt/sources.list
    Casting all caution to the wind, I skipped the part about making a backup
  • apt-get update; apt-get upgrade; apt-get dist-upgrade
  • reboot
That’s all I did. There was one surprise: thunderbird complained about not being able to connect to “mini1” but I had no idea why. I did, however, try to ssh there from my now-buster desktop; passwordless ssh failed. “ssh -v” showed me that I had the wrong kind of keys now, so I regenerated keys on mini1 (a mac mini), added the the contents of .ssh/id_mini1.pub mini1’s .ssh/authorized_keys, and copied the private key into .ssh/id_mini1. Things started looking better. But thunderbird still said it couldn't connect to mini1. Why was that? The server settings said it was connecting to 127.0.0.1:143; was I running dovecot locally? I had to be, right? Yes, according to /etc/dovecot/dovecot.conf:
 26 # A comma separated list of IPs or hosts where to listen in for connections.
 27 # "*" listens in all IPv4 interfaces, "::" listens in all IPv6 interfaces.
 28 # If you want to specify non-default ports or anything more complex,
 29 # edit conf.d/master.conf.
 30 #listen = *, ::
 31 listen = 127.0.0.2
So when I tried "sudo dovecot", it said the ssl key couldn’t be found, and even gave me a pathname. So I commented it out in /etc/dovecot/conf.d/10-ssl.conf:
  1 ##
  2 ## SSL settings
  3 ##
  4 
  5 # SSL/TLS support: yes, no, required. 
  6 ssl = no                                           ←was “yes”
  7 
  8 # PEM encoded X.509 SSL/TLS certificate and private key. They're opened before
  9 # dropping root privileges, so keep the key file unreadable by anyone but
 10 # root. Included doc/mkcert.sh can be used to easily generate self-signed
 11 # certificate, just make sure to update the domains in dovecot-openssl.cnf
 12 ssl_cert = </etc/dovecot/private/dovecot.pem
 13 #ssl_key = </etc/dovecot/private/dovecot.key       ←commented out
I also decided that we don’t need SSL, since hey, 127.0.0.2.

I think I’ll have to do something about the scanner, but otherwise i believe phase I (Stretch → Buster) was pretty easy.

Phase II: Buster → Bullseye

The first part was straightforward but took ... a couple hours? As above, all these steps were done as root:
  • change /etc/apt/sources.list to refer to Bullseye rather than Buster
  • apt-get update; apt-get ugrade; apt-get dist-upgrade
  • reboot

And then…
    “…something about the scanner” ⇐ THIS
So xsane couldn't find the scanner. I got some advice to “sudo scanimage -L” which found only another device (our renter's all-in-one).

A web search on “brother scanners linux” (no quotes) led me to Install the scanner driver (deb) - Linux - BrotherUSA, where I learned what to do, again as root:

collin@p64:~$ brsaneconfig4 -q
                                                             ← nothing appeared here at all!
collin@p64:~$ sudo brsaneconfig4 -a name=mfc9340 model=MFC-9340CDW ip=192.168.1.40
collin@p64:~$ brsaneconfig4 -q
* *MFC-9340CDW [   192.168.1.40]  mfc9340                    ← Now that’s more like it!
collin@p64:~$ sudo scanimage -L
device `brother4:net1;dev0' is a Brother mfc9340 MFC-9340CDW
device `escl:https://192.168.0.235:443' is a HP ENVY 6400 series [BA2627] SSL flatbed scanner
collin@p64:~$ 
And with that, the scanner works. Mail works. Browsers (both firefox and chrome) work. Maybe something else won’t, but that's all for today.

July 29 update: auto-sleep, crashplan

A new-ish feature in Bullseye (it may have come in with Buster?) is that the box will sleep if I walk away for 20 minutes or so. This is fine, except when I'm logged in over VPN and I'm using vmware horizon (yes i have to use that for work). I need to disable it when I'm at work, so I just leave the Settings app up, with Power settings selected. Then it's front and center and it's obvious to me that auto-sleep is either on or off. Usually I re-enable it when I'm done with work for the day.

I got email yesterday or so, saying that my backups haven't happened since the OS upgrade. My first thought was, oh, it's because of being asleep. But then I turned off auto-sleep and tried logging in to the code42 app... no joy. After thrashing for a while, I went to the code42.com support site, where I couldn't login. Oh, I had forgotten that I'd already added google authenticator to firefox! I used it to get my magic rotating number and I was in. I saw the advice to reinstall the app. OK, fine; I downloaded the package, did what came naturally, and said: sudo /usr/local/crashplan/service.sh start

Which didn't work. I looked at /usr/local/crashplan/log/service.log.0 and... something about missing libuaw.so; a web search led me to this post on reddit with the answer: zcat the "CrashPlanSmb_10.0.0.cpi" file (it's in the downloaded tarball), find libuaw.so in the appropriate subdirectory of nlib/ (they didn't have "debian11" but the Reddit-or said ubuntu20, which thankfully worked) and copy it into /usr/local/crashplan/nlib; shazzam! up and running.

August 20 update: convert(1) issue solved; crashplan, not so much

I wanted to convert a ".png" file to PDF, as I have done many times before, but now I get:

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.
How annoying! A web search got me to this stackoverflow post; I followed its advice, but kept the old /etc/ImageMagick-6/policy.xml as /etc/ImageMagick-6/policy.xml-dist; here's the diff:
$ diff -u /etc/ImageMagick-6/policy.xml{-dist,}
--- /etc/ImageMagick-6/policy.xml-dist	2021-04-20 07:37:59.000000000 -0700
+++ /etc/ImageMagick-6/policy.xml	2022-08-20 20:19:06.962756505 -0700
@@ -90,10 +90,12 @@
   <!-- in order to avoid to get image with password text -->
   <policy domain="path" rights="none" pattern="@*"/>
   <!-- disable ghostscript format types -->
-  <policy domain="coder" rights="none" pattern="PS" />
+  <!--  -->
+  <policy domain="coder" rights="read|write" pattern="PS" />
   <policy domain="coder" rights="none" pattern="PS2" />
   <policy domain="coder" rights="none" pattern="PS3" />
   <policy domain="coder" rights="none" pattern="EPS" />
-  <policy domain="coder" rights="none" pattern="PDF" />
+  <!--  -->
+  <policy domain="coder" rights="read|write" pattern="PDF" />
   <policy domain="coder" rights="none" pattern="XPS" />
 </policymap>
And now, convert(1) can do everything I was accustomed to using it for.

Crashplan, though… I thought (from a few weeks back) that the service started so everything was good, but not so much! /usr/local/crashplan/log/engine_output.log ends like this:

  Java virtual machine created.
Starting service.
[08.20.22 10:24:56.599 INFO  main             com.code42.utils.ClassFinder] Loaded classpaths in 1960 ms
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc570a458a7, pid=326423, tid=326535
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.12+7 (11.0.12+7) (build 11.0.12+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (11.0.12+7, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libuaw.so+0x1c8a7]  std::filesystem::path::~path()+0x7
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /usr/local/crashplan/hs_err_pid326423.log
[thread 326523 also had an error]
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#
WHOA, did you see that libuaw.so part up there? Do I maybe have a bad libuaw.so? It's late now, though, so maybe in another day or two I'll…

September 24 update: crashplan solved!

OK, so what happened last month was: I did more searching and read another reddit post which said that in 2019 (!) there was no excuse not to run apps in containers. Good point! But I didn't want to do that without learning more about containers. I mean, I can spell "runc exec -it <CONTAINERNAME> bash" (wait, did I get that right?) but beyond that…

I ordered and waited for my own personal copy of Docker: Up & Running 2/e then procrastinated… today I decided, better get to it. Now, where was that reddit post that pointed me at the container to…? Huh, couldn’t find it. Instead I re-found the reddit article linked above, but this time I noticed this comment:

Thanks a lot! Worked on Debian 10 with the ubuntu18 file and on Debian 11 with the ubuntu20 file.

WHOA; I’ve got debian11; which one did I install last month (which crashed)? Based on this output

$ ls -o /tmp/code42-install/nlib/*/libuaw.so
-rw-rw-r-- 1 collin 486008 Aug 21 08:25 /tmp/code42-install/nlib/rhel7/libuaw.so
-rw-rw-r-- 1 collin 243440 Aug 21 08:25 /tmp/code42-install/nlib/rhel8/libuaw.so
-rw-rw-r-- 1 collin 232944 Aug 21 08:25 /tmp/code42-install/nlib/ubuntu18/libuaw.so
-rw-rw-r-- 1 collin  52456 Aug 21 08:25 /tmp/code42-install/nlib/ubuntu20/libuaw.so
$
I evidently installed the rhel7 one. D’oh! Replacing it by the ubuntu20 one and crashplan is running, without having to do the container thing.

PS: the container thing is https://github.com/jlesage/docker-crashplan

Wednesday, July 06, 2022

I actually like being on vacation

I wrote recently that I’m not very good at the “rest” thing. But our two weeks on a faraway island were very enjoyable.

What made it like that? While we were there, I noticed myself calming down. Part of it was the lack of crowds. Was it that, or the lack of car traffic? Was it that people never (well, hardly ever) seemed to be in a hurry? I think it was all those things, but after we got home, I began to understand what else:

  • In the mail was a letter from the DMV. I committed an infraction, a traffic violation, about a year ago. I went to traffic school to get the point erased from my record, but the DMV was reporting my violation to our insurance company, which of course increased my premium. It was not a small increase. So last January, I filled out the form asking why they were being so mean to me. The letter said: “the court hasn’t certified that you completed traffic school.” I spent some quality time on my computer, tracking down the information, then thought to look at the court’s website. After some study, I made a note to myself to call them Tuesday morning.
  • There was email from my insurance company; I thought I had sent them several hundred dollars for the first installment on this year's earthquake insurance, but they never got it. I actually couldn’t find any record of having sent it, other than my scribbled “should arrive May 21.” So another electronic errand...
  • The bank emailed me that they believed that a certain credit card charge was valid (I had disputed the charge); they sent a letter that included a receipt, supposedly signed by us, proving that we made the charge. The slip had our room number on it, but the handwriting on the receipt was not ours, and whatever they wrote on the slip, it was neither our names nor our initials. This was not a small charge. I composed a short note explaining that yes, that was our room number on the slip, but… I also whined that the vendor is irresponsible for not reading the room number from the key or at least verifying the surname. Another electronic errand.
  • The post also contained a notification that Bentley’s animal license fees were overdue. They asked for a reply in case Bentley had died. Another electronic errand.
There were other errands, electronic and otherwise, but you get the idea. I decided that part of the wonder of that experience was being on that island, but another part was being on vacation.

And by “on vacation” I don’t mean “not doing my software job.” What I mean is, not being responsible for dead pets, whining at people who charge me for things I didn’t buy or couldn’t use, missed payments, government agencies, etc.

As our cab driver said, “Back to reality.”

Saturday, November 06, 2021

__stack_chk_fail(): What It Means

Recently I had a “stack smashing” incident to debug at work. It turned out to be a little more complicated than the example I'm about to show you, but at the bottom it was the same. Here's a silly example program.
collin@collin-t450:~/stack-chk$ pr -tn smash.c
    1	#include <stdio.h>
    2	#include <string.h>
    3	
    4	/*
    5	 * Bad programming practice
    6	 */
    7	static void
    8	oops(char const *buf)
    9	{
   10		char local[10];		/* if strlen(buf) > 9, then */
   11		strcpy(local, buf);	/* this line could smash the stack. */
   12		printf("%s\n", local);
   13	}
   14	
   15	/*
   16	 * this provides a level of indirection.
   17	 */
   18	static void
   19	doit(char const *buf)
   20	{
   21		oops(buf);
   22	}
   23	
   24	int
   25	main(int argc, char **argv)
   26	{
   27		char *msg = "hi there";
   28		if (argc > 1 && argv[1] && *argv[1]) {
   29			msg = argv[1];
   30		}
   31		doit(msg);
   32		return 0;
   33	}
collin@collin-t450:~/stack-chk$ 
So, main calls doit, passing either a short string—“hi there”—or a string of indeterminate length provided on the command line.

In turn, doit passes that same string to oops, which blindly copies it into a fixed-length buffer, local (line 11). This is a very bad practice because strcpy can overrun the destination (i.e. it can write past the end of local) if the source string (buf) is too long.

We compile it like this:

collin@collin-t450:~/stack-chk$ make smash
cc -fstack-protector-all -Wall -Werror -g    smash.c   -o smash
collin@collin-t450:~/stack-chk$ 
That -fstack-protector-all says to insert the stack-protector (or stack checking) code into every routine. This is a really good idea, and you should always have it in your makefiles.

Now if we run the program with a short string, all is well, but if the string is longer than about 9 bytes, bad things happen:

collin@collin-t450:~/stack-chk$ ./smash
hi there
collin@collin-t450:~/stack-chk$ ./smash hello
hello
collin@collin-t450:~/stack-chk$ ulimit -c unlimited       ←so we can get a coredump in case of abort
collin@collin-t450:~/stack-chk$ ./smash good-morning
good-morning
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
collin@collin-t450:~/stack-chk$ 
What is “stack smashing”, and how does the code tell that it’s happened? Let’s run gdb on the crash dump and see.
collin@collin-t450:~/stack-chk$ gdb smash core
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
…copyright, GPL, hints, etc. here
Reading symbols from smash...done.
[New LWP 18026]
Core was generated by `./smash good-morning'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fcddd774535 in __GI_abort () at abort.c:79
#2  0x00007fcddd7cb508 in __libc_message (action=, 
    fmt=fmt@entry=0x7fcddd8d607b "*** %s ***: %s terminated\n")
    at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fcddd85c80d in __GI___fortify_fail_abort (
    need_backtrace=need_backtrace@entry=false, 
    msg=msg@entry=0x7fcddd8d6059 "stack smashing detected") at fortify_fail.c:28
#4  0x00007fcddd85c7c2 in __stack_chk_fail () at stack_chk_fail.c:29
#5  0x0000556ce7f921a4 in oops (buf=0x7ffcbf978577 "good-morning") at smash.c:13
#6  0x0000556ce7f921cd in doit (buf=0x7ffcbf978577 "good-morning") at smash.c:21
#7  0x0000556ce7f9224d in main (argc=2, argv=0x7ffcbf977e68) at smash.c:31
(gdb) 
Right. gdb’s “bt” command displays a backtrace; the above shows main calling doit calling oops, which called __stack_chk_fail. The numbers on the left are the frame numbers at the time the crash dump was taken.

I’ll belabor the maybe-obvious for a bit before continuing. Each frame is a record of where the caller expects to resume execution, when/if the callee returns; that is, the caller’s return-address is pushed onto the stack and then the machine begins executing the callee, in the new frame.

Let's see how __stack_chk_fail was called.

(gdb) f 5
#5  0x0000556ce7f921a4 in oops (buf=0x7ffcbf978577 "good-morning") at smash.c:13
13	}
(gdb) disass oops
Dump of assembler code for function oops:
   0x0000564f056b5155 <+0>:	push   %rbp
   0x0000564f056b5156 <+1>:	mov    %rsp,%rbp
   0x0000564f056b5159 <+4>:	sub    $0x30,%rsp
   0x0000564f056b515d <+8>:	mov    %rdi,-0x28(%rbp)
   0x0000564f056b5161 <+12>:	mov    %fs:0x28,%rax       put magic value → %rax
   0x0000564f056b516a <+21>:	mov    %rax,-0x8(%rbp)     stash %rax; → %rbp-8  
   0x0000564f056b516e <+25>:	xor    %eax,%eax
   0x0000564f056b5170 <+27>:	mov    -0x28(%rbp),%rdx
   0x0000564f056b5174 <+31>:	lea    -0x12(%rbp),%rax
   0x0000564f056b5178 <+35>:	mov    %rdx,%rsi
   0x0000564f056b517b <+38>:	mov    %rax,%rdi
   0x0000564f056b517e <+41>:	callq  0x564f056b5030 <strcpy@plt>
   0x0000564f056b5183 <+46>:	lea    -0x12(%rbp),%rax
   0x0000564f056b5187 <+50>:	mov    %rax,%rdi
   0x0000564f056b518a <+53>:	callq  0x564f056b5040 <puts@plt>
   0x0000564f056b518f <+58>:	nop
   0x0000564f056b5190 <+59>:	mov    -0x8(%rbp),%rax                          fetch saved magic value
   0x0000564f056b5194 <+63>:	xor    %fs:0x28,%rax                            xor vs real magic
   0x0000564f056b519d <+72>:	je     0x564f056b51a4 <oops+79>                 jump if saved still matches real magic
   0x0000564f056b519f <+74>:	callq  0x564f056b5050 <__stack_chk_fail@plt>    saved value got corrupted; abort
=> 0x0000564f056b51a4 <+79>:	leaveq 
   0x0000564f056b51a5 <+80>:	retq   
End of assembler dump.
(gdb) 
The “=>” in the left-hand margin shows what we were about to execute in the frame—that is, the return point from calling __stack_chk_fail. But how did we decide to call it?

Let's go back to the beginning of oops. At the <+12> location, we move %fs:0x28 into %rax. What is %fs:0x28? I'm deducing from the usage that it holds a magic value which we store into %rbp-0x8, uh, I mean -0x8(%rbp)—at <+21>.

Then, at <+59>, we read -0x8(%rbp) into %rax; we xor it with %fs:0x28 at <+63>. If they are equal, the xor at +63 will set %rax to zero; then the je (“jump if equal”) at +72 sends us to the leaveq instruction. But if they are not equal, we call __stack_chk_fail.

To summarize, then, at the beginning of the routine, we store %fs:0x28 into %rbp-0x8; just before returning, we load the (64-bit) word in %rbp-0x8 and compare it to %fs:0x28. If it matches, we’re good, but if not, we call __stack_chk_fail. This stack checking code is inserted into every function—provided that

  • you use the compiler option -fstack-protector-all
  • the function can return (i.e., it doesn’t consist only of a no-break, no-return infinite loop)
  • the function call isn’t optimized out by the optimizer (e.g., compiled with -O0, or function isn’t declared static)

So what is at %rbp-0x8 here?

(gdb) x/xg $rbp-0x8
0x7ffcbf977d18:	0x88a84adec300676e
(gdb) 
Alert readers may note that the low-order 3 bytes of the above (i.e., the 00676e) turn out to match the tail end of the string provided on the command line: “ng\0”; this is an effect of a bad programming practice: we wrote into a 10-byte buffer, but we wrote more than 10 bytes!
(gdb) info locals
local = "good-morni"
(gdb) p sizeof local
$1 = 10
(gdb) x/s local
0x7ffcbf977d0e:	"good-morning"
(gdb)
So by writing past the end of the 10-byte buffer “local[]”, we stomped (with “ng\0”) on the magic value used for stack check.

Now let’s have a look at the value(s) of %fs:0x28 stored elsewhere, starting one level “up,” that is, with oops’s caller:

(gdb) up
#6  0x0000556ce7f921cd in doit (buf=0x7ffcbf978577 "good-morning") at smash.c:21
21		oops(buf);
(gdb) x/8i doit
   0x556ce7f921a6 <doit>:	push   %rbp
   0x556ce7f921a7 <doit+1>:	mov    %rsp,%rbp
   0x556ce7f921aa <doit+4>:	sub    $0x20,%rsp
   0x556ce7f921ae <doit+8>:	mov    %rdi,-0x18(%rbp)
   0x556ce7f921b2 <doit+12>:	mov    %fs:0x28,%rax       put magic value → %rax
   0x556ce7f921bb <doit+21>:	mov    %rax,-0x8(%rbp)     stash %rax; → %rbp-8
   0x556ce7f921bf <doit+25>:	xor    %eax,%eax
   0x556ce7f921c1 <doit+27>:	mov    -0x18(%rbp),%rax
(gdb) x/xg $rbp-8
0x7ffcbf977d48:	0x88a84adec3f40c00
(gdb) 
Now let's try one more.
(gdb) up
#7  0x0000556ce7f9224d in main (argc=2, argv=0x7ffcbf977e68) at smash.c:31
31		doit(msg);
(gdb) x/8i main
   0x556ce7f921e4 <main>:	push   %rbp
   0x556ce7f921e5 <main+1>:	mov    %rsp,%rbp
   0x556ce7f921e8 <main+4>:	sub    $0x20,%rsp
   0x556ce7f921ec <main+8>:	mov    %edi,-0x14(%rbp)
   0x556ce7f921ef <main+11>:	mov    %rsi,-0x20(%rbp)
   0x556ce7f921f3 <main+15>:	mov    %fs:0x28,%rax       put magic value → %rax
   0x556ce7f921fc <main+24>:	mov    %rax,-0x8(%rbp)     stash %rax; → %rbp-8  
   0x556ce7f92200 <main+28>:	xor    %eax,%eax
(gdb) x/xg $rbp-8
0x7ffcbf977d78:	0x88a84adec3f40c00
(gdb) 
Now compare the above vs. what we had in $rbp-8 in frame 5:
0x7ffcbf977d78: 0x88a84adec3f40c00 ← frame 7
0x7ffcbf977d48: 0x88a84adec3f40c00 ← frame 6
0x7ffcbf977d18: 0x88a84adec300676e ← frame 5
Identical except for the low-order 3 bytes. The value of %fs:0x28 stored by main and doit match; the value in oops doesn’t. And that’s how the program knew there really was stack smashing.

A few more points

  • The stack checking code doesn’t always catch overruns. It did in this case because the variable named local was immediately below (i.e., lower memory address) the spot where the magic value was stashed away, and we overran local by a few bytes. But if we did something nastier in oops, like
    local[59] = 'x';
    then oops’s magic value would not have been disturbed. Probably doit’s magic value would have been detectably corrupted, and the backtrace would have shown doit, not oops, calling __stack_chk_fail
  • If local had been allocated via malloc(3) with that size, rather than being an on-stack variable, buffer overruns might be detected by bug-catching code in malloc or free, rather than code surrounding a call to __stack_chk_fail.
  • As alluded to earlier, if function(s) are declared static and the file is compiled with optimization, the corruption may occur in an “interior” or lower-level routine (a callee of a callee of…) but the stack-checking code may be present in only the caller. This is in fact what happened when I added “-O2” to the compilation command for smash.c
    collin@collin-t450:~/stack-chk$ cc -fstack-protector-all -Wall -Werror -g -O2   smash.c   -o smash
    collin@collin-t450:~/stack-chk$ ./smash good-morning
    good-morning
    *** stack smashing detected ***: <unknown> terminated
    Aborted (core dumped)
    collin@collin-t450:~/stack-chk$ gdb smash core
    ...
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from smash...done.
    [New LWP 16201]
    Core was generated by `./smash good-morning'.
    Program terminated with signal SIGABRT, Aborted.
    b#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
    50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
    (gdb) bt
    #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
    #1  0x00007f3d06aea535 in __GI_abort () at abort.c:79
    #2  0x00007f3d06b41508 in __libc_message (action=, 
        fmt=fmt@entry=0x7f3d06c4c07b "*** %s ***: %s terminated\n")
        at ../sysdeps/posix/libc_fatal.c:181
    #3  0x00007f3d06bd280d in __GI___fortify_fail_abort (
        need_backtrace=need_backtrace@entry=false, 
        msg=msg@entry=0x7f3d06c4c059 "stack smashing detected")
        at fortify_fail.c:28
    #4  0x00007f3d06bd27c2 in __stack_chk_fail () at stack_chk_fail.c:29
    #5  0x000055a71abc70d9 in main (argc=<optimized out>, argv=)
        at smash.c:32
    (gdb)