{"id":671,"date":"2020-07-29T15:05:10","date_gmt":"2020-07-29T22:05:10","guid":{"rendered":"https:\/\/porkrind.org\/missives\/?p=671"},"modified":"2023-03-01T13:47:34","modified_gmt":"2023-03-01T21:47:34","slug":"failed-emacs-builds-hanging-kernels-abort-oh-my","status":"publish","type":"post","link":"https:\/\/porkrind.org\/missives\/failed-emacs-builds-hanging-kernels-abort-oh-my\/","title":{"rendered":"Failed Emacs builds, hanging kernels, abort(), oh my"},"content":{"rendered":"<p>My <a href=\"https:\/\/emacsformacos.com\/builds#Nightlies\">nightly Emacs builds<\/a> stopped about a month and a half ago. A couple days after I noticed it was failing I tried to debug the issue and found that building openssl was hanging\u2014I found that Jenkins was timing out after an hour or so. I should mention that it&#8217;s dying on a Mac OS X 10.10 (Yosemite) VM, which is currently the oldest macOS version I build Emacs for. I tried building manually in a terminal\u2014the next day it was still sitting there, not finished. I decided it was going to be annoying and so I avoided looking deeper into it for another month and a half (sorry!). Today I tracked it down and &#8220;fixed&#8221; it\u2014here is my tale\u2026<\/p>\n<p>I (ab)use homebrew to build openssl. Brew configures openssl then runs <code>make<\/code> and then <code>make test<\/code>. <code>make test<\/code> was hanging. Looking at the process list, I could see <code>01-test_abort.t<\/code> was the hanging test. It was also literally the first test. Weird. I checked out <a href=\"https:\/\/github.com\/openssl\/openssl\/blob\/OpenSSL_1_1_1g\/test\/aborttest.c\">the code<\/a>:<\/p>\n<pre><code class=\"language-c\">#include &lt;openssl\/crypto.h&gt;\n\nint main(int argc, char **argv)\n{\n    OPENSSL_die(\"Voluntary abort\", __FILE__, __LINE__);\n    return 0;\n}\n<\/code><\/pre>\n<p>Well, that seems straightforward enough. Why would it hang? I tryed to kill off the test process to see if it would continue. There was a lib wrapper, a test harness and the actual binary from the source shown above\u2014they all died nicely except for the actual <code>aborttest<\/code> executable. I couldn&#8217;t even <code>kill -9<\/code> that one\u2014that usually means there&#8217;s some sort of kernel issue going on\u2014<em>everything<\/em> should be <code>kill -9<\/code>able.<\/p>\n<p>Next I ran it by hand (<code>.\/util\/shlib_wrap.sh test\/aborttest<\/code>) and confirmed that the test just hung and couldn&#8217;t be killed. I built it on a different machine and it worked just fine there. So I dug into the <a href=\"https:\/\/github.com\/openssl\/openssl\/tree\/OpenSSL_1_1_1g\">openssl code<\/a>. What does <code>OPENSSL_die()<\/code> do, anyway?<\/p>\n<p><a href=\"https:\/\/github.com\/openssl\/openssl\/blob\/OpenSSL_1_1_1g\/crypto\/cryptlib.c#L416\">Not much:<\/a><\/p>\n<pre><code class=\"language-c\">\/\/ Win32 #ifdefs removed for readability:\nvoid OPENSSL_die(const char *message, const char *file, int line)\n{\n    OPENSSL_showfatal(\"%s:%d: OpenSSL internal error: %s\\n\",\n                      file, line, message);\n    abort();\n}\n<\/code><\/pre>\n<p>Ok, that&#8217;s nothing. What about <code>OPENSSL_showfatal()<\/code>? <a href=\"https:\/\/github.com\/openssl\/openssl\/blob\/OpenSSL_1_1_1g\/crypto\/cryptlib.c#L399\">Also not much:<\/a><\/p>\n<pre><code class=\"language-c\">{\n#ifndef OPENSSL_NO_STDIO\n    va_list ap;\n\n    va_start(ap, fmta);\n    vfprintf(stderr, fmta, ap);\n    va_end(ap);\n#endif\n}\n<\/code><\/pre>\n<p>That&#8217;s just a print, nothing exciting. Hmmm. So I wrote a test program:<\/p>\n<pre><code class=\"language-c\">#include &lt;stdlib.h&gt;\n\nint main() {\n  abort();\n}\n<\/code><\/pre>\n<p>I compiled it up and\u2026 it hung, too! What?? Ok. I tried it as root (hung). Tried it with <code>dtruss<\/code>:<\/p>\n<pre><code class=\"language-plain\">...lots of dtruss nonsense snipped...\n37772\/0xcaf1:  sigprocmask(0x3, 0x7FFF5DD71C74, 0x0)         = 0x0 0\n37772\/0xcaf1:  __pthread_sigmask(0x3, 0x7FFF5DD71C80, 0x0)       = 0 0\n37772\/0xcaf1:  __pthread_kill(0x603, 0x6, 0x0)       = 0 0\n<\/code><\/pre>\n<p>So it got to the kernel with <code>pthread_kill()<\/code> and hung after that. So I tried another sanity check: In one terminal I ran <code>sleep 100<\/code>. In another I found the process id and did <code>kill -ABRT $pid<\/code>. The kill returned, but the sleep was now hung and not able to be killed by <code>kill -9<\/code>, like everything else. Now I was very confused. This can&#8217;t be a real bug, everyone would be seeing this! Maybe it&#8217;s a VM emulation issue caused by my version of VMWare? I can&#8217;t upgrade my VMWare because the next version after mine requires Mac OS 10.14 but this Mac Mini of mine only supports 10.13. Sigh. Also, the Emacs builds were working just fine and then they suddenly stopped and I hadn&#8217;t updated the OS <em>or<\/em> the host OS <em>or<\/em> VMWare. Nothing was adding up!<\/p>\n<p>As I sanity check, I decided to reinstall the OS on the VM (right over top of the existing one, nothing clean or anything). There was a two hour long sidetrack here with deleting VM snapshots, resizing the VM disk (which required booting into recovery mode), downloading the OS installer and finally letting the install run. But that&#8217;s not important. The important part is that I opened up terminal immediately after the OS installed and ran my <code>abort()<\/code> test:<\/p>\n<pre><code class=\"language-shell-session\">$ .\/test\nAbort trap: 6\n$ \n<\/code><\/pre>\n<p>It worked! How about OpenSSL?<\/p>\n<pre><code class=\"language-shell-session\">$ .\/util\/shlib_wrap.sh test\/aborttest\ntest\/aborttest.c:14: OpenSSL internal error: Voluntary abort\nAbort trap: 6\n$ \n<\/code><\/pre>\n<p>Yay! But why?? I don&#8217;t actually know. Was it a corrupt kernel? A bad driver that got installed? (What drivers would get installed on this Jenkins builder?) I don&#8217;t feel very satisfied here. I&#8217;m quite skeptical, in fact! But it&#8217;s working. Emacs builds should start coming out again. And I can ignore everything again until the next fire starts! \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My nightly Emacs builds stopped about a month and a half ago. A couple days after I noticed it was failing I tried to debug the issue and found that building openssl was hanging\u2014I found that Jenkins was timing out after an hour or so. I should mention that it&#8217;s dying on a Mac OS &hellip; <a href=\"https:\/\/porkrind.org\/missives\/failed-emacs-builds-hanging-kernels-abort-oh-my\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Failed Emacs builds, hanging kernels, abort(), oh my<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-671","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/posts\/671","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/comments?post=671"}],"version-history":[{"count":10,"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/posts\/671\/revisions"}],"predecessor-version":[{"id":738,"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/posts\/671\/revisions\/738"}],"wp:attachment":[{"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/media?parent=671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/categories?post=671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/porkrind.org\/missives\/wp-json\/wp\/v2\/tags?post=671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}