Building a Cloud IDE from Scratch: Architecting "Just Run It" with Kubernetes, WebSockets, and Real-Time Terminals
A Deep Dive into Creating a Production-Ready Cloud Development Environment
Have you ever wondered what happens behind the scenes when you click "Create Project" on platforms like Replit, CodeSandbox, or Gitpod? How do they instantly spin up isolated development environments, provide real-time code editing, and deliver a fully functional terminal—all running seamlessly in your browser?
I spent months building Just Run It, a cloud-based IDE that does exactly that. This wasn't just a toy project—it's a production-grade platform that dynamically provisions Kubernetes pods, manages real-time file synchronization via WebSockets, and implements browser-based terminals using pseudo-TTY. In this article, I'll take you through the complete architecture, share the technical decisions I made, reveal the challenges I encountered, and document the hard-won lessons learned.
By the end of this deep dive, you'll understand:
- How to dynamically provision isolated containers for each user project
- How to implement real-time file synchronization with WebSockets
- How to create browser-based terminals with pseudo-TTY
- How to design a multi-tenant system with Kubernetes
- The scalability considerations for serving thousands of concurrent users
- The production gotchas that nobody tells you about
Let's dive in.
The Problem: Why Build a Cloud IDE?
I built Just Run It because I wanted to understand how platforms like Replit, CodeSandbox, and Gitpod actually work under the hood.
What happens when you click "Create Project"? How do they spin up isolated environments in seconds? How do they handle real-time file synchronization? How do they make terminals work in a browser?
These questions led me down a rabbit hole of infrastructure complexity that I was eager to explore:
- Kubernetes orchestration — How do you dynamically provision containers for thousands of users?
- Real-time communication — How do you sync file changes across WebSocket connections?
- Process management — How do you create a real terminal experience in a browser using PTY?
- Distributed storage — How do you ensure data persistence when containers are ephemeral?
- Dynamic networking — How do you route traffic to the right container based on subdomains?
- Multi-tenancy — How do you isolate users while sharing the same infrastructure?
Building a cloud IDE isn't just about creating a product—it's a crash course in distributed systems, container orchestration, and real-time architectures. Every component touches multiple layers of the stack, from the browser's WebSocket connection all the way down to Kubernetes API calls and container runtime.
That complexity is exactly what I wanted to dive into. Just Run It became my vehicle for understanding how modern cloud platforms are architected, one Kubernetes manifest at a time.
Architecture Overview
Just Run It consists of three core microservices orchestrating a Kubernetes cluster, with AWS S3 providing persistent storage:
┌─────────────────────────────────────────────────────────────────────────┐
│ USER BROWSER │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Landing │ │ Monaco │ │ xterm.js │ │ Output │ │
│ │ Page │ │ Editor │ │ Terminal │ │ iframe │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼───────────────────────┘
│ │ │ │
└─────────────┴──────┬──────┴─────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Init │ │ Orchestrator │ │ NGINX │
│ Service │ │ Service │ │ Ingress │
└────┬─────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ AWS S3 │◄─────│ Kubernetes │──────►│Runner Pod │
│(Storage) │ │ API │ │(Per Project) │
└──────────┘ └──────────────┘ └──────────────┘
Each component plays a critical role. Let me break them down.
Service 1: The Init Service — Project Bootstrapping
The Problem: When a user clicks "Create New Project," they need a starting point. Nobody wants to stare at an empty directory, and manually setting up project structures is tedious.
The Solution: The Init Service copies language-specific templates from S3, giving users a fully configured starting point.
The Flow
User selects "Node.js"
→ Init Service copies template from S3
→ Project ready in seconds
Implementation
app.post("/project", async (req, res) => {
const { projectId, language } = req.body;
// Copy template files from S3
// templates/node-js/* → projects/{projectId}/*
await copyProjectFolder(
`templates/${language}`,
`projects/${projectId}`
);
return res.send("Project created!");
});The magic happens in the S3 helper function:
// List all files in the template folder
const listedObjects = await s3.listObjectsV2({
Bucket: "my-bucket",
Prefix: "templates/node-js"
}).promise();
// Copy each file to the new project location
for (const object of listedObjects.Contents) {
await s3.copyObject({
Bucket: "my-bucket",
CopySource: `my-bucket/${object.Key}`,
Key: object.Key.replace("templates/node-js", `projects/${projectId}`)
}).promise();
}Why S3 Over a Database?
I chose S3 for file storage because:
- Cost-effective for large files — Pennies per GB versus expensive database storage
- No size limits — Projects can grow to gigabytes without issues
- Built-in versioning — Future feature potential without re-architecting
- Kubernetes native integration — Init containers can directly mount S3
Template Structure
S3 Bucket
├── templates/
│ ├── node-js/
│ │ ├── package.json
│ │ ├── index.js
│ │ └── README.md
│ ├── python/
│ │ ├── requirements.txt
│ │ └── main.py
│ └── react/
│ ├── package.json
│ ├── src/
│ └── public/
└── projects/
├── abc123/ ← User's project
└── xyz789/ ← Another user's project
This structure makes adding new languages trivial—just upload a new template folder to S3.
Service 2: The Orchestrator — Kubernetes Wizardry
This is where the real magic happens. When a user opens their project, the Orchestrator dynamically creates Kubernetes resources to spin up an isolated development environment.
The Challenge
I needed to:
- Create a dedicated container for each project
- Pre-load project files before the application starts
- Expose two endpoints: WebSocket (IDE communication) and HTTP (app output)
- Route traffic based on subdomain (
project-id.myplatform.com)
The Solution: Dynamic Kubernetes Manifests
Instead of manually creating YAML files for every project, I use a template with placeholders:
apiVersion: apps/v1
kind: Deployment
metadata:
name: service_name # ← Placeholder
spec:
replicas: 1
template:
spec:
# Init container downloads files from S3 BEFORE main container starts
initContainers:
- name: copy-s3-resources
image: amazon/aws-cli
command: ["/bin/sh", "-c"]
args:
- aws s3 cp s3://my-bucket/projects/service_name/ /workspace/ --recursive
volumeMounts:
- name: workspace-volume
mountPath: /workspace
# Main container runs the development environment
containers:
- name: runner
image: my-runner-image:latest
ports:
- containerPort: 3001 # WebSocket
- containerPort: 3000 # HTTP
volumeMounts:
- name: workspace-volume
mountPath: /workspace
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1"
memory: "1Gi"The Orchestrator reads this template, replaces service_name with the actual project ID, and applies it to Kubernetes:
const readAndParseKubeYaml = (filePath, projectId) => {
const fileContent = fs.readFileSync(filePath, 'utf8');
// Parse multi-document YAML (Deployment + Service + Ingress)
const docs = yaml.parseAllDocuments(fileContent).map((doc) => {
let docString = doc.toString();
// Replace placeholder with actual project ID
docString = docString.replace(/service_name/g, projectId);
return yaml.parse(docString);
});
return docs;
};
app.post("/start", async (req, res) => {
const { projectId } = req.body;
const manifests = readAndParseKubeYaml("./service.yaml", projectId);
for (const manifest of manifests) {
switch (manifest.kind) {
case "Deployment":
await k8sAppsApi.createNamespacedDeployment("default", manifest);
break;
case "Service":
await k8sCoreApi.createNamespacedService("default", manifest);
break;
case "Ingress":
await k8sNetworkingApi.createNamespacedIngress("default", manifest);
break;
}
}
res.send({ message: "Environment ready!" });
});The Init Container Pattern
This is one of my favorite Kubernetes patterns. The init container runs before the main container and:
- Downloads project files from S3
- Places them in a shared volume (
/workspace) - Exits successfully
- Main container starts with files already in place
Pod Lifecycle:
┌─────────────────────────────────────────────────────────┐
│ 1. Init Container (aws-cli) │
│ └── aws s3 cp s3://bucket/projects/abc123/ /workspace│
│ │
│ 2. Init Container exits (success) │
│ │
│ 3. Main Container (runner) starts │
│ └── /workspace already has all project files! │
└─────────────────────────────────────────────────────────┘
This pattern is elegant, reliable, and built into Kubernetes. No custom orchestration needed.
Ingress: The Routing Magic
Each project gets two subdomains:
| Domain | Port | Purpose |
|---|---|---|
abc123.justrunit.work.gd | 3001 | WebSocket for IDE communication |
abc123.justrunit.run.place | 3000 | HTTP for viewing app output |
The Ingress configuration makes this possible:
apiVersion: networking.k8s.io/v1
kind: Ingress
spec:
rules:
- host: abc123.justrunit.work.gd
http:
paths:
- path: /
backend:
service:
name: abc123
port:
number: 3001 # WebSocket
- host: abc123.justrunit.run.place
http:
paths:
- path: /
backend:
service:
name: abc123
port:
number: 3000 # HTTPWhy two domains? Security isolation. The user's running application shouldn't have access to the IDE's WebSocket connection. Separate domains provide clean separation of concerns.
Service 3: The Runner — Where Code Comes Alive
The Runner is the heart of the platform. It runs inside each project's pod and handles:
- Real-time file operations via WebSocket
- Terminal emulation with PTY
- Syncing changes back to S3
WebSocket Events
I use Socket.IO for real-time communication. Here's the event protocol:
| Event | Direction | Purpose |
|---|---|---|
loaded | Server → Client | Initial file tree |
fetchDir | Client → Server | List directory contents |
fetchContent | Client → Server | Read file content |
updateContent | Client → Server | Save file (+ S3 sync) |
requestTerminal | Client → Server | Create terminal session |
terminalData | Bidirectional | Terminal I/O |
Implementation
io.on("connection", async (socket) => {
// Extract project ID from subdomain
// "abc123.justrunit.work.gd" → "abc123"
const host = socket.handshake.headers.host;
const projectId = host?.split('.')[0];
// Send initial file structure
socket.emit("loaded", {
rootContent: await fetchDir("/workspace", "")
});
// File operations
socket.on("fetchContent", async ({ path }, callback) => {
const content = await fs.readFile(`/workspace/${path}`, "utf8");
callback(content);
});
socket.on("updateContent", async ({ path, content }) => {
// Save locally (instant feedback)
await fs.writeFile(`/workspace/${path}`, content);
// Persist to S3 (survives pod restarts!)
await s3.putObject({
Bucket: "my-bucket",
Key: `projects/${projectId}/${path}`,
Body: content
}).promise();
});
});The Dual-Write Strategy
Every file save triggers two writes:
- Local filesystem — Instant feedback for the user
- S3 — Durability across pod restarts
| Operation | Local Filesystem | S3 |
|---|---|---|
| Read file | ~1ms | ~50-200ms |
| Write file | ~1ms | ~100-300ms |
| List directory | ~1ms | ~50-150ms |
The local filesystem provides snappy UX, while S3 ensures data survives pod terminations.
The Terminal: PTY Magic
This was the trickiest part of the entire project. Browsers can't run bash directly, so I use node-pty to create pseudo-terminals.
What is a PTY?
A pseudo-terminal is a pair of virtual devices:
- Master side: Controlled by our application
- Slave side: Looks like a real terminal to programs (bash, vim, etc.)
When you run bash attached to a PTY, it behaves exactly like it would in a real terminal—supporting colors, cursor movement, job control, and more.
Architecture
┌───────────┐ ┌───────────┐ ┌──────────┐
│ xterm.js │◄────────►│ Socket.IO │◄────────►│ node-pty │
│ (Browser) │ WebSocket│ (Server) │ IPC │ (PTY) │
└───────────┘ └───────────┘ └────┬─────┘
│
▼
┌───────────┐
│ bash │
│ (process) │
└───────────┘
Implementation
import { spawn } from 'node-pty';
class TerminalService {
private sessions: Map<string, IPty> = new Map();
createPty(socketId: string, onData: (data: string) => void) {
// Spawn a real bash process
const pty = spawn('bash', [], {
name: 'xterm-256color',
cols: 80,
rows: 24,
cwd: '/workspace',
env: {
...process.env,
PS1: '\\u@runner:\\w$ ' // Custom prompt
}
});
// Stream output to client
pty.onData((data) => onData(data));
this.sessions.set(socketId, pty);
return pty;
}
write(socketId: string, data: string) {
// Forward keystrokes to bash
this.sessions.get(socketId)?.write(data);
}
}On the frontend, xterm.js renders the terminal:
// Frontend
socket.emit("requestTerminal");
socket.on("terminal", ({ data }) => {
// Render output in xterm.js
terminal.write(data);
});
terminal.onData((data) => {
// Send keystrokes to server
socket.emit("terminalData", { data });
});The result? A fully functional bash terminal in the browser:
user@runner:/workspace$ npm install
added 150 packages in 3.2s
user@runner:/workspace$ node index.js
Server running on port 3000Signal Handling
Real terminals support signals like Ctrl+C (SIGINT) and Ctrl+Z (SIGTSTP). These work automatically with PTY because the terminal driver handles them:
User presses Ctrl+C
↓
xterm.js sends: "\x03" (ASCII ETX)
↓
Socket.IO transmits to server
↓
node-pty writes "\x03" to PTY master
↓
Terminal driver interprets as SIGINT
↓
bash sends SIGINT to foreground process
↓
Process terminates (or handles signal)
The Complete Data Flow
Let me walk through what happens when a user creates and uses a project:
Phase 1: Project Creation
- User clicks "Create Node.js Project"
- Frontend →
POST /project { projectId: "abc123", language: "node-js" } - Init Service copies S3:
templates/node-js/*→projects/abc123/* - Frontend navigates to
/coding?projectId=abc123
Phase 2: Environment Provisioning
- Frontend →
POST /start { projectId: "abc123" } - Orchestrator creates Kubernetes resources:
- Deployment (with init container + runner)
- Service (internal networking)
- Ingress (domain routing)
- Kubernetes schedules pod on a node
- Init container runs:
aws s3 cp→/workspace/ - Runner container starts
Phase 3: Real-Time Coding
- Frontend connects:
ws://abc123.justrunit.work.gd - Runner sends file tree via
loadedevent - User clicks file →
fetchContent→ Monaco Editor displays it - User edits →
updateContent→ Local save + S3 sync - User opens terminal →
requestTerminal→ PTY spawned - User types "npm start" →
terminalData→ bash executes - App runs on port 3000 → visible at
abc123.justrunit.run.place
Scalability: How Many Users Can This Handle?
This is the million-dollar question. Let's break it down.
Resource Requirements Per Project
Each project pod requests:
- 1 CPU core
- 1 GB RAM
Cluster Capacity
| Cluster Size | Node Specs | Concurrent Projects | Use Case |
|---|---|---|---|
| Small | 3 nodes × (4 CPU, 16GB) | ~30-40 | Development/Testing |
| Medium | 10 nodes × (8 CPU, 32GB) | ~150-200 | Small startup |
| Large | 50 nodes × (16 CPU, 64GB) | ~1,000+ | Growing platform |
| Enterprise | 200+ nodes | ~5,000+ | Full scale |
Bottlenecks & Solutions
| Bottleneck | Impact | Solution |
|---|---|---|
| Ingress Controller | Single entry point | Deploy multiple replicas, use cloud LB |
| Orchestrator Service | K8s API calls are slow | Add caching, queue requests |
| S3 Rate Limits | 3,500 PUT/s per prefix | Shard by project ID prefix |
| Pod Startup Time | 10-30 seconds | Pre-warm pool of pods |
Cost Optimization
At scale, costs matter. Here's what I'd implement:
- Idle Detection — Terminate pods after 30 minutes of inactivity
- Spot Instances — Use preemptible nodes for 60-80% cost savings
- Right-sizing — Offer different tiers (0.5 CPU for small projects)
- Cold Storage — Archive inactive projects to S3 Glacier
Networking Deep Dive
One of the most complex aspects is networking. Each project needs its own subdomain, and we need to handle both WebSocket and HTTP traffic differently.
Wildcard DNS: The Foundation
Instead of creating a DNS record for every project, I use wildcard DNS:
*.justrunit.work.gd → Load Balancer IP
*.justrunit.run.place → Load Balancer IP
This means abc123.justrunit.work.gd, xyz789.justrunit.work.gd, and any other subdomain all resolve to the same IP. The routing to the correct pod happens at the Ingress layer.
NGINX Ingress Controller: Traffic Cop
The NGINX Ingress Controller inspects the Host header to determine which pod to route to:
Request: GET / HTTP/1.1
Host: abc123.justrunit.work.gd
Connection: Upgrade
Upgrade: websocket
┌─────────────────────────────────────────────────────────┐
│ NGINX Ingress Controller │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Termination (decrypt HTTPS) │
│ 2. Parse Host header: "abc123.justrunit.work.gd" │
│ 3. Look up Ingress rules for this host │
│ 4. Find: route to Service "abc123" port 3001 │
│ 5. Detect WebSocket upgrade, maintain connection │
│ 6. Forward to pod IP (from Service endpoints) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────┐
│ Pod: abc123 │
│ Port: 3001 │
└─────────────────┘
TLS Certificates at Scale
Managing SSL certificates for thousands of subdomains sounds nightmarish, but wildcard certificates make it simple:
spec:
tls:
- hosts:
- "*.justrunit.work.gd"
secretName: wildcard-work-gd-tls
- hosts:
- "*.justrunit.run.place"
secretName: wildcard-run-place-tlsI use cert-manager with Let's Encrypt to automatically provision and renew these certificates.
Production Considerations
Building the core functionality is one thing. Running it in production is another.
Monitoring & Observability
A distributed system needs comprehensive monitoring. Key metrics I track:
# Resource usage
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- nginx_ingress_controller_requests_total
# Application metrics
- socket_io_connected_clients
- terminal_sessions_active
- s3_operations_total
- pod_startup_duration_seconds
Error Handling
Every external call needs robust error handling:
socket.on("updateContent", async ({ path, content }) => {
try {
await fs.writeFile(`/workspace/${path}`, content);
try {
await s3.putObject({...}).promise();
} catch (s3Error) {
// S3 failure shouldn't break UX
logger.error('s3_sync_failed', { path, error: s3Error.message });
// Queue for retry
retryQueue.add({ path, content, projectId });
}
} catch (fsError) {
socket.emit('error', { message: 'Failed to save file' });
logger.error('file_save_failed', { path, error: fsError.message });
}
});Graceful Shutdown
When a pod is terminated, clean up gracefully:
process.on('SIGTERM', async () => {
logger.info('shutdown_initiated', {});
// Stop accepting new connections
io.close();
// Give existing operations time to complete
await new Promise(resolve => setTimeout(resolve, 5000));
// Close all terminal sessions
terminalService.closeAll();
// Flush any pending S3 writes
await retryQueue.flush();
process.exit(0);
});Security Hardening
Security is non-negotiable for a platform that runs arbitrary user code.
Container Isolation:
securityContext:
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true # Except /workspaceNetwork Policies (prevent pods from communicating with each other):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8 # Block internal network
- 172.16.0.0/12
- 192.168.0.0/16Resource Limits (prevent resource exhaustion):
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"Lessons Learned
1. Init Containers Are Underrated
The init container pattern solved my biggest challenge: how to pre-populate the filesystem before the app starts. It's elegant, reliable, and built into Kubernetes. No custom orchestration needed.
2. WebSockets Need Careful Error Handling
Connections drop. Networks fail. I learned to implement:
- Automatic reconnection with exponential backoff
- Message queuing during disconnects
- Heartbeat pings to detect dead connections
3. PTY Is Not Just "Running Commands"
A real terminal needs:
- Proper signal handling (Ctrl+C, Ctrl+Z)
- Window resize events
- ANSI escape code support
- Session persistence
4. Multi-Tenancy Is Hard
Isolating users requires thinking about:
- Resource limits (CPU, memory, disk)
- Network policies (prevent cross-pod communication)
- Filesystem isolation (each pod has its own
/workspace) - Process isolation (containerization handles this)
5. Persistence Strategy Matters
I chose S3 because:
- Pods are ephemeral—they can be killed anytime
- S3 provides durability (11 9's)
- Init containers make S3 → Pod sync seamless
- Real-time sync keeps S3 updated
What I'd Do Differently
If I were starting over:
Use a Message Queue — Decouple the Orchestrator from synchronous K8s API calls. RabbitMQ or Redis Streams would make the system more resilient.
Implement Pod Pooling — Pre-create a pool of warm pods to reduce startup latency from 30 seconds to <2 seconds.
Cost Analysis
Let's talk money. Running a cloud IDE isn't cheap.
Per-Project Costs (AWS, us-east-1)
| Resource | Specification | Monthly Cost |
|---|---|---|
| EC2 (pod) | 1 vCPU, 1GB RAM | ~$7.50 |
| S3 Storage | 100MB project | ~$0.0023 |
| Data Transfer | ~1GB/month | ~$0.09 |
Total per active project: ~$7.50/month
Platform Costs (Fixed)
| Resource | Specification | Monthly Cost |
|---|---|---|
| EKS Control Plane | Managed Kubernetes | $72 |
| Load Balancer | Network LB | $16 |
| NAT Gateway | Outbound traffic | $32 |
| Init/Orchestrator nodes | 2× t3.medium | $60 |
Fixed monthly cost: ~$180
Break-Even Analysis
Fixed costs: $180/month
Per-project cost: $7.50/month
At $10/user/month pricing:
Break-even = 180 / (10 - 7.50) = 72 users
At $15/user/month pricing:
Break-even = 180 / (15 - 7.50) = 24 users
Conclusion
Building Just Run It has been an incredible learning journey. What started as curiosity about "how does Replit work?" turned into a deep dive through:
- Kubernetes orchestration and dynamic resource management
- Real-time systems with WebSockets and event-driven architecture
- Process management with pseudo-terminals
- Distributed storage patterns with S3
- Multi-tenant security and isolation
Tech Stack Summary
Frontend:
- React
- Monaco Editor (VS Code editor)
- xterm.js (terminal emulation)
- Socket.IO Client
Backend:
- Node.js, Express, TypeScript
- Socket.IO (real-time communication)
- node-pty (pseudo-terminal)
Infrastructure:
- Kubernetes (container orchestration)
- NGINX Ingress Controller
- Docker
- AWS S3 (persistent storage)
- kubernetes-client-node (K8s API)