I trained a foundation model on a $2/hour cloud GPU and watched it on my laptop. No AWS required.
Or: how Tailscale + Lambda Labs + a tracking server on your laptop replaces hours of AWS infrastructure work and costs.
I needed a GPU. Specifically, an NVIDIA A100 to scale a tabular foundation model from a 1% sample (a 60-minute MacBook MPS run) up to a 20% sample (a five-to-eight hour run that needs real hardware). Single GPU. One job. Maybe twenty hours of total compute.
The default playbook for this used to be:
- Open the AWS console.
- Pick a
p4d.24xlarge(eight A100s, $32/hour minimum) because that's all they sell with an A100 in it on-demand, or fight for capacity in spot. - Set up an IAM user with permissions you only half-understand.
- Configure a VPC, subnets, security groups, an internet gateway, an SSH key pair, an EBS volume of a size you'll regret in either direction.
- Pick an AMI. Was it Deep Learning AMI version 73 or 74? Does that have the right CUDA?
- SSH in. Realize the disk you provisioned is full because the AMI's base image already took 80 of your 100 GB.
- Realize you forgot to set up CloudWatch and now you have no way to see GPU utilization without installing the NVIDIA exporter and pointing it at a Prometheus you haven't deployed.
- Finally start training, four hours later.
- Forget to terminate. Wake up to a $400 bill.
I did not do this. I did this instead:
- Sign up for Lambda Labs. (Five minutes.)
- Add an SSH key to the dashboard.
- Set up a Lambda API token in
.env. - Write a 200-line Python launcher that hits Lambda's REST API (used Claude Code).
- Install Tailscale on the laptop.
That's the entire one-time setup. Every cloud training run from there looks like this:
uv run python scripts/lambda.py launch --filesystem nanochat
uv run python scripts/lambda.py wait <id>
# SSH into the instance, start training inside tmux
# Watch MLflow on http://localhost:5000 — but it's logging from a GPU 1,000 km away
Total cost for a five-to-eight hour single-A100 run: $10 to $16. Compared to the on-demand AWS equivalent — p4d.24xlarge at $32.77/hour — even a five-hour run is $163.85. Tenfold pricing difference, before you've even paid for EBS, transfer egress, or the IAM auditor your security team makes you talk to.
Let me show you how this works end to end.
The stack
- Lambda Labs Cloud (lambda.ai) — sells single-GPU instances on-demand by the minute. A100 40GB SXM4 is $1.99/hour, A10 24GB is $1.29/hour, H100 SXM5 is $4.29/hour. Their REST API is simple, well-documented, and not buried under five layers of AWS abstractions.
- Tailscale (tailscale.com) — a zero-config WireGuard mesh. Free for personal use. Once installed on two machines, they can reach each other by hostname over an encrypted tunnel, regardless of NAT or firewalls.
- MLflow (mlflow.org) — open-source experiment tracking. Runs on your laptop with a SQLite backend. The cloud training run logs to it in real time over Tailscale.
- uv (astral.sh/uv) — fast Python package manager, makes "set up an environment on a fresh instance" a 30-second operation instead of a 10-minute Anaconda dance.
- A 200-line Python launcher that wraps Lambda's API and replaces every reason you'd normally need Terraform.
Get a GPU in five minutes
Lambda's launch API takes a JSON body with the instance type, region, SSH key name, and (optionally) a persistent filesystem and firewall ruleset. The whole "launch a new GPU instance" implementation is about 30 lines:
def launch(instance_type, region, ssh_key_name, filesystem=None, firewall=None):
body = {
"region_name": region,
"instance_type_name": instance_type,
"ssh_key_names": [ssh_key_name],
"quantity": 1,
"name": "synthia-train",
}
if filesystem:
body["file_system_names"] = [filesystem]
if firewall:
# API wants [{"id": "..."}], not names
body["firewall_rulesets"] = [{"id": resolve_firewall_id(firewall)}]
return api("POST", "/instance-operations/launch", body)
The API returns an instance ID. Poll the GET /instances endpoint until status is active, and you have a publicly-reachable Linux box with the GPU drivers preinstalled, CUDA in your path, and Python ready to go. Total wall time from launch to first SSH: about three minutes.
There is no VPC. No security group beyond what Lambda calls a "firewall ruleset" — a single object with a list of {protocol, port_range, source_network} tuples. No subnet planning. No NAT gateway hidden in a different invoice line item. Just an IP address and an SSH port.
Persistent storage, the right way
The first thing I noticed: my training data is 13 GB of DuckDB. Uploading that from my house to Lambda over residential broadband takes 20-30 minutes. Doing that every cloud session would be obscene.
Lambda's answer is Filesystems — block storage you create once, attach to instances at launch time, and which survives termination. They cost roughly $0.20/GB/month. My 13 GB sits in a 64 GB filesystem for about $13/month — less than ninety minutes of A100 compute.
The workflow:
- Create the filesystem once via the dashboard or API.
- Pass
--filesystem nanochatto the launcher. The instance boots with/lambda/nfs/nanochat/mounted. - The first time, upload the data to
/lambda/nfs/nanochat/synthia/. - Every subsequent launch, the data is already there. The deploy script checks file size on the remote side and skips the rsync entirely if it matches.
The launch-with-filesystem and AWS equivalent is roughly:
| Step | Lambda | AWS |
|---|---|---|
| Create persistent storage | One API call, one minute | Create an EBS volume in the right AZ, then later, create a snapshot for portability across AZs |
| Attach to instance | --filesystem nanochat at launch |
Configure the instance's block device mapping in user-data or via aws ec2 attach-volume after boot |
| Mount inside instance | Auto-mounted at /lambda/nfs/<name> |
Manually mkfs, mount, edit /etc/fstab, hope you got the device path right |
| Cost for 64 GB / month | ~$13 | ~$6.40 for gp3 |
AWS is cheaper on storage alone, but they assume you have the operational budget to deal with the rest of the iceberg.
The transport problem (and why Tailscale won)
Here is the part I learned the hard way.
My initial deploy script opened a fresh SSH connection for each step: verify connectivity, check the filesystem mount, create directories, rsync the repo, rsync the artifacts, rsync the database, create a symlink. Seven SSH connections in 30 seconds.
The first three worked. The fourth timed out. Ping still got through; TCP to port 22 was being silently dropped.
I spent an hour diagnosing this. It wasn't Lambda's firewall (I attached an "allow all TCP from 0.0.0.0/0" ruleset and the symptom persisted). It wasn't my ISP (Google was fine, ICMP was fine). It was almost certainly fail2ban or a similar SSH brute-force protection somewhere in the path — Ubuntu's default ships with it, and seven rapid SSH connections from the same source IP looks identical to a credential-stuffing attack.
The standard "fix" is SSH connection multiplexing — open one TCP connection to port 22 and have all subsequent ssh and rsync calls reuse it via a control socket. That works. But it's a workaround for a problem you shouldn't have in the first place.
The better answer: don't use public-internet port 22 at all. Tailscale gives you a free WireGuard mesh between any two machines you control. Once both the laptop and the Lambda instance are on the same tailnet, you can SSH and rsync over the Tailscale IP, which routes over UDP/41641 to the cloud instance and never touches port 22 from the public internet. No brute-force detector ever fires. No NAT conntrack ever gets confused.
Getting Tailscale onto a fresh Lambda instance is two commands, run via Lambda's web Jupyter terminal (which is served through their proxy, no SSH needed):
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
The second command prints a https://login.tailscale.com/... URL. Click it in any browser logged into your Tailscale account, authorize the device, done. The Lambda instance is now a peer in your tailnet, addressable by IP (100.x.y.z) or hostname.
From then on:
ssh ubuntu@100.64.12.34 # SSH over Tailscale
rsync -avzP big-file.db ubuntu@100.64.12.34:~/data/ # rsync over Tailscale
Throughput on my home gigabit fiber to a Lambda instance in us-east-1 via Tailscale: 60-80 MB/s on small files, 7-15 MB/s sustained on the 13 GB database (with some packet loss apparently in the middle). Connection reliability: completely stable. Zero mid-transfer failures across multiple deploys.
MLflow on your laptop, logged from the cloud
This is the part that genuinely felt like the future.
MLflow's default deployment story is: run a tracking server on some shared host, give it a persistent backend (S3 + RDS in the AWS world), and have everyone point their MLFLOW_TRACKING_URI environment variable at it. That's three more pieces of infrastructure to own.
The new way:
- Run
mlflow ui --host 0.0.0.0on your laptop. Backend is a local SQLite file. - Set
MLFLOW_SERVER_ALLOWED_HOSTS=juggernaut:*,100.64.10.20:*,...so MLflow's FastAPI security middleware accepts Host headers from your Tailscale hostname (a small gotcha — it 403s anything that isn'tlocalhostby default). - On the Lambda instance, set
MLFLOW_TRACKING_URI=http://juggernaut:5000. (juggernautis my laptop's Tailscale hostname.) - Start training. Watch the metrics stream into
http://localhost:5000in your laptop browser, live, from a GPU 1,000 km away.
The cloud instance only sees an MLflow URL. It doesn't know or care that "juggernaut" is actually a laptop sitting on my desk behind a residential NAT. Tailscale routes the HTTP calls over WireGuard. From the cloud instance's perspective, it's just an MLflow server reachable on the local network.
When training finishes, the experiment is already in my laptop's SQLite. No sync, no S3 download, no "let me VPN into the cluster to look at last night's run." It was always there, in real time.
Cost vs AWS: actual numbers
The 20% scale-up run takes about 6-8 hours on an A100 with bf16 and torch.compile enabled. Let me compare:
| Provider | Instance | Hourly | 8-hour run |
|---|---|---|---|
| Lambda Labs | gpu_1x_a100_sxm4 | $1.99 | $15.92 |
| AWS on-demand | p4d.24xlarge (8x A100) | $32.77 | $262.16 |
| AWS Spot (best case) | p4d.24xlarge spot | ~$10-15 | $80-120 (with eviction risk) |
| AWS on-demand single A100 | (does not exist as an SKU) | n/a | n/a |
| GCP | a2-highgpu-1g (1x A100) | $3.67 | $29.36 |
| Azure | NC24ads A100 v4 | $3.67 | $29.36 |
AWS does not sell single-A100 instances on-demand at all. You're paying for eight whether you use them or not. Spot is competitive on per-hour pricing but eviction during an 8-hour run is annoying enough that I wouldn't bother for production work. Lambda is half the price of the next-cheapest single-A100 option (GCP), and 16x cheaper than the only AWS path that actually puts an A100 on demand.
Add to this:
- Lambda persistent filesystem: $13/month for 64 GB. Used across every run.
- Tailscale: free for the personal tier (up to 100 devices).
- MLflow: free, runs on the laptop you already own.
Total fixed monthly: ~$13. Per-run: ~$8-15 depending on length.
The AWS-equivalent run uses S3 for data ($0.023/GB/month + transfer), EBS for instance storage, possibly RDS for MLflow, an internet gateway, NAT gateway charges if your training instance is in a private subnet (oh god), and the per-hour line items above. A friend of mine ran similar work on AWS and the EC2 portion was about 30% of the total monthly bill.
When you'd still pick AWS
Honestly:
- Production deployments with strict compliance. SOC 2, HIPAA, FedRAMP — AWS has the certifications and the audit story. Lambda doesn't.
- Multi-region failover or 99.99% SLA requirements. Lambda has three regions and capacity isn't guaranteed.
- You need managed services that are first-class on AWS. Bedrock, SageMaker model registry, EventBridge integrations, etc.
- Your org's data already lives in S3 and the egress to copy it elsewhere is more expensive than the GPU itself.
- Fleet operations at scale. If you're training 50 jobs in parallel, you want Kubernetes or SageMaker, not a launcher script.
But for experimental work — research, training a single model, running a one-off evaluation, doing the prototyping that 95% of ML actually is — Lambda + Tailscale + local-MLflow is a strictly better experience than AWS. Faster to set up, cheaper to run, easier to reason about.
The hiccups, honestly
This is not a "everything just worked" story. Things that broke and what I learned:
- Three relaunches before I got the right instance configuration. First I forgot the persistent filesystem (so the data would die on termination). Then I forgot to check that my local SSH private key matched the public key registered on Lambda (it didn't — I had an old RSA key in Lambda and only ED25519 keys locally). Then I forgot to attach a firewall ruleset. Lesson: do a single pre-flight checklist before the first
launchcall, not after each one fails. - MLflow's FastAPI security middleware 403s any Host header outside localhost. This was a 15-minute red herring. The fix is
MLFLOW_SERVER_ALLOWED_HOSTS='juggernaut:*,100.x.y.z:*', with the:*port wildcard so fnmatch matches. - macOS BSD rsync doesn't support
--append-verify. Standard Linux rsync (3.0+) does. Just use--partialand don't bother with--append-verify. - Public-internet port 22 is unreliable for repeated SSH calls. Use Tailscale. See above.
- Lambda's API rejected
firewall_rulesets: ["name"]. It wants[{"id": "..."}]. Half an API call.
The total time lost to these hiccups was maybe two hours. The total time saved compared to a "real" AWS deployment is, conservatively, two weeks.
Bottom line
For single-GPU research workloads, the Lambda + Tailscale + laptop-MLflow stack is shockingly good. It deletes maybe 80% of the operational overhead that AWS imposes on small ML teams, at half the per-hour price, with about the same provisioning latency once your scripts are written.
It is not going to replace AWS for production fleets. It is going to replace AWS for the prototyping phase that comes before production fleets, and that phase is most of your work.
If you spend more than $200/month on AWS GPU instances and you're not in compliance-mandated territory, you owe it to yourself to spend an afternoon trying this setup. The afternoon pays for itself in the first run.