Running Rails on Fly.io with Tailscale
I’ve been working on a Rails app to manage some of the more toilsome aspects of running @birdsinmuseums, and wanted to try out Fly.io to host it. Since I’m the only user for the foreseeable future, I decided to run the Fly app on my tailnet rather than accessible on the public internet to avoid building auth.
There’s an existing guide from Tailscale that covers the basics of running the Tailscale daemon and client inside a Fly.io container, but I ran into a bunch of weird edges, so this might be useful if you’re trying to do the same thing.
Tailscale installation
When you run flyctl launch
for the first time on an existing Rails repo, it generates a bunch of
files:
Dockerfile
: builds the container imagefly.rake
: a Rakefile container custom commandsfly.toml
: custom configuration to run the Rails app
The Tailscale guide advises wget
ing the static binary and unpacking the tarball in the Dockerfile
container build. I couldn’t find any signed digests for the binaries, so ended up checksumming the binary
myself and committing a SHA256SUMS
to my repo. Friends don’t let friends curl|sh
unsigned binaries!
You can do this with something like:
$ sha256sum tailscale_1.34.2_amd64.tgz
b74cdf5735576e7f0210506235c8ec72d472bdb079a020c5c1095cd3152e69e7 tailscale_1.34.2_amd64.tgz
$ sha256sum tailscale_1.34.2_amd64.tgz > SHA256SUMS
# Dockerfile
COPY SHA256SUMS SHA256SUMS
RUN wget https://pkgs.tailscale.com/stable/${TSFILE} && \
sha256sum --check SHA256SUMS && \
tar xzf ${TSFILE} --strip-components=1
The Dockerfile
that Fly generates uses fullstaq-ruby, which in turn uses a Debian-based base image.
The system packages the Tailscale guide recommends for Alpine don’t map cleanly, so you’ll instead need to
install iptables
and iproute2
with apt
.
There’s probably a cleaner way of installing everything with apt
, which I may revisit down the line.
Dockerfile script munging
The Tailscale guide recommends adding a bin/start.sh
script to start the Tailscale daemon and login.
You’ll want to modify Dockerfile
and fly.toml
to point at this start script.
If you try deploying with just this change, you’ll see some weird errors.
The generated Dockerfile munges all scripts in bin/
, and injects some Ruby to change the working directory:
# Adjust binstubs to run on Linux and set current working directory
RUN chmod +x /app/bin/* && \
sed -i 's/ruby.exe\r*/ruby/' /app/bin/* && \
sed -i 's/ruby\r*/ruby/' /app/bin/* && \
sed -i '/^#!/aDir.chdir File.expand_path("..", __dir__)' /app/bin/*
It seemed a little weird to assume all scripts in bin/
would be written in Ruby, but it’s easy to
have it only apply to Ruby ones:
- sed -i '/^#!/aDir.chdir File.expand_path("..", __dir__)' /app/bin/*
+ sed -i '/^#!.*ruby/aDir.chdir File.expand_path("..", __dir__)' /app/bin/*
At this point, the app was able to connect to my tailnet!
Cleanly logging out of the tailnet
I’m only planning on running one instance of this app on Fly and wanted for it to have a stable URL.
While Tailscale’s magic DNS works well for this, I noticed that after deploys or restarts, I’d end up with
multiple Tailscale hosts with number-suffixed IDs (e.g. my-app
, my-app-1
, my-app-2
). This is annoying because
I’d have to look up the host name every time I wanted to access the service.
It turns out that for ephemeral nodes, you need to run tailscale logout
so they’re cleaned up properly.
I ended up modifying the custom start script recommended in the Tailscale guide (bin/start.sh
)
to not only start up Tailscale before
invoking Rails, but also to tear it down.
#!/bin/bash
set -x
pid=0
# Trap SIGINT so we gracefully shut down tailscale so node
# names don't collide upon deploy
sigint_handler() {
if [ $pid -ne 0 ]; then
./tailscale logout
kill -SIGINT "$pid"
wait "$pid"
fi
exit 130; # 128 + 2 -- SIGINT
}
trap 'kill ${!}; sigint_handler' SIGINT
echo "Starting tailscale daemon"
./tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/var/run/tailscale/tailscaled.sock &
# Sleep for a bit so that we avoid a race condition with tailscaled
sleep 3
echo "Joining tailscale network"
./tailscale up --authkey=${TAILSCALE_AUTHKEY} --hostname=my-app
# Manually invoke fly:swapfile then server rake target
# since rake doesn't forward signals correctly.
echo "Starting rails server"
bin/rails fly:swapfile
bin/rails server &
pid="$!"
while true
do
tail -f /dev/null & wait ${!}
done
Signal propagation
Fly generates a fly:server
Rake target that handles both twiddling some swap settings and invoking the
server. As far as I can tell, the way the script shells out to bin/rails server
doesn’t propagate Unix
signals correctly so the process hangs after SIGINT until it hits the specified Fly kill_timeout
.
To work around this, I invoked the fly:swapfile
and server
Rake targets directly.
tailscale client deadlock
In some cases upon restart, the script would hang running tailscale up
. I was able to resolve it by
flyctl ssh console
ing into the app and running tailscale up
manually. After adding a sleep
this
seemed to go away, so it seems like there’s a race condition during tailscaled
’s startup process
that causes for the client to hang indefinitely.
Client healthcheck failure
While things were mostly working, I noticed that tailscale status
returned a weird healthcheck error:
# Health check:
# - router: setting up filter/ts-input: running [/usr/sbin/iptables -t filter -N ts-input --wait]: exit status 4: iptables v1.8.7 (nf_tables): Could not fetch rule set generation id: Invalid argument
Some internet sleuthing advised that v1.8.7 is backed by nf_tables, and may be incompatible with some
software. Switching back to the legacy iptables
binaries seems to address this issue:
# Dockerfile
- && rm -rf /var/lib/apt/lists /var/cache/apt/archives
+ && rm -rf /var/lib/apt/lists /var/cache/apt/archives \
+ && update-alternatives --set iptables /usr/sbin/iptables-legacy \
+ && update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
DNS TTL
The last hiccup I ran into was that sometimes after a deploy, the page wouldn’t load in the browser. Tailscale currently sets the default DNS TTL to 600 seconds (10 minutes), and flushing my local DNS cache sidesteps the issue. There’s a Github Issue that proposes lowering the TTL for ephemeral nodes.
Closing
It took me longer than expected to get the signal propagation pieces working (mostly because testing changes requires a deploy and a restart to invoke new teardown logic), but overall I’ve been impressed with how slick both products are!