Running Rails on Fly.io with Tailscale

13 Jan 2023

I’ve been working on a Rails app to manage some of the more toilsome aspects of running @birdsinmuseums, and wanted to try out Fly.io to host it. Since I’m the only user for the foreseeable future, I decided to run the Fly app on my tailnet rather than accessible on the public internet to avoid building auth.

There’s an existing guide from Tailscale that covers the basics of running the Tailscale daemon and client inside a Fly.io container, but I ran into a bunch of weird edges, so this might be useful if you’re trying to do the same thing.

Tailscale installation

When you run flyctl launch for the first time on an existing Rails repo, it generates a bunch of files:

Dockerfile: builds the container image
fly.rake: a Rakefile container custom commands
fly.toml: custom configuration to run the Rails app

The Tailscale guide advises wgeting the static binary and unpacking the tarball in the Dockerfile container build. I couldn’t find any signed digests for the binaries, so ended up checksumming the binary myself and committing a SHA256SUMS to my repo. Friends don’t let friends curl|sh unsigned binaries!

You can do this with something like:

$ sha256sum tailscale_1.34.2_amd64.tgz
b74cdf5735576e7f0210506235c8ec72d472bdb079a020c5c1095cd3152e69e7  tailscale_1.34.2_amd64.tgz
$ sha256sum tailscale_1.34.2_amd64.tgz > SHA256SUMS

# Dockerfile
COPY SHA256SUMS SHA256SUMS
RUN wget https://pkgs.tailscale.com/stable/${TSFILE} && \
  sha256sum --check SHA256SUMS && \
  tar xzf ${TSFILE} --strip-components=1

The Dockerfile that Fly generates uses fullstaq-ruby, which in turn uses a Debian-based base image. The system packages the Tailscale guide recommends for Alpine don’t map cleanly, so you’ll instead need to install iptables and iproute2 with apt.

There’s probably a cleaner way of installing everything with apt, which I may revisit down the line.

Dockerfile script munging

The Tailscale guide recommends adding a bin/start.sh script to start the Tailscale daemon and login. You’ll want to modify Dockerfile and fly.toml to point at this start script.

If you try deploying with just this change, you’ll see some weird errors.

The generated Dockerfile munges all scripts in bin/, and injects some Ruby to change the working directory:

# Adjust binstubs to run on Linux and set current working directory
RUN chmod +x /app/bin/* && \
    sed -i 's/ruby.exe\r*/ruby/' /app/bin/* && \
    sed -i 's/ruby\r*/ruby/' /app/bin/* && \
    sed -i '/^#!/aDir.chdir File.expand_path("..", __dir__)' /app/bin/*

It seemed a little weird to assume all scripts in bin/ would be written in Ruby, but it’s easy to have it only apply to Ruby ones:

-    sed -i '/^#!/aDir.chdir File.expand_path("..", __dir__)' /app/bin/*
+    sed -i '/^#!.*ruby/aDir.chdir File.expand_path("..", __dir__)' /app/bin/*

At this point, the app was able to connect to my tailnet!

Cleanly logging out of the tailnet

I’m only planning on running one instance of this app on Fly and wanted for it to have a stable URL. While Tailscale’s magic DNS works well for this, I noticed that after deploys or restarts, I’d end up with multiple Tailscale hosts with number-suffixed IDs (e.g. my-app, my-app-1, my-app-2). This is annoying because I’d have to look up the host name every time I wanted to access the service.

It turns out that for ephemeral nodes, you need to run tailscale logout so they’re cleaned up properly.

I ended up modifying the custom start script recommended in the Tailscale guide (bin/start.sh) to not only start up Tailscale before invoking Rails, but also to tear it down.

#!/bin/bash
set -x

pid=0

# Trap SIGINT so we gracefully shut down tailscale so node
# names don't collide upon deploy
sigint_handler() {
  if [ $pid -ne 0 ]; then
    ./tailscale logout
    kill -SIGINT "$pid"
    wait "$pid"
  fi
  exit 130; # 128 + 2 -- SIGINT
}

trap 'kill ${!}; sigint_handler' SIGINT

echo "Starting tailscale daemon"
./tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/var/run/tailscale/tailscaled.sock &

# Sleep for a bit so that we avoid a race condition with tailscaled
sleep 3
echo "Joining tailscale network"
./tailscale up --authkey=${TAILSCALE_AUTHKEY} --hostname=my-app

# Manually invoke fly:swapfile then server rake target
# since rake doesn't forward signals correctly.
echo "Starting rails server"
bin/rails fly:swapfile
bin/rails server &
pid="$!"

while true
do
  tail -f /dev/null & wait ${!}
done

Signal propagation

Fly generates a fly:server Rake target that handles both twiddling some swap settings and invoking the server. As far as I can tell, the way the script shells out to bin/rails server doesn’t propagate Unix signals correctly so the process hangs after SIGINT until it hits the specified Fly kill_timeout. To work around this, I invoked the fly:swapfile and server Rake targets directly.

tailscale client deadlock

In some cases upon restart, the script would hang running tailscale up. I was able to resolve it by flyctl ssh consoleing into the app and running tailscale up manually. After adding a sleep this seemed to go away, so it seems like there’s a race condition during tailscaled’s startup process that causes for the client to hang indefinitely.

Client healthcheck failure

While things were mostly working, I noticed that tailscale status returned a weird healthcheck error:

# Health check:
#     - router: setting up filter/ts-input: running [/usr/sbin/iptables -t filter -N ts-input --wait]: exit status 4: iptables v1.8.7 (nf_tables): Could not fetch rule set generation id: Invalid argument

Some internet sleuthing advised that v1.8.7 is backed by nf_tables, and may be incompatible with some software. Switching back to the legacy iptables binaries seems to address this issue:

# Dockerfile
-    && rm -rf /var/lib/apt/lists /var/cache/apt/archives
+    && rm -rf /var/lib/apt/lists /var/cache/apt/archives \
+    && update-alternatives --set iptables /usr/sbin/iptables-legacy \
+    && update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

DNS TTL

The last hiccup I ran into was that sometimes after a deploy, the page wouldn’t load in the browser. Tailscale currently sets the default DNS TTL to 600 seconds (10 minutes), and flushing my local DNS cache sidesteps the issue. There’s a Github Issue that proposes lowering the TTL for ephemeral nodes.

Closing

It took me longer than expected to get the signal propagation pieces working (mostly because testing changes requires a deploy and a restart to invoke new teardown logic), but overall I’ve been impressed with how slick both products are!