Back to Blog
Backend
Kubernetes
WebSockets
Cloud IDE
DevOps
Node.js
AWS
Docker
Real-time Systems

Building a Cloud IDE from Scratch: Architecting 'Just Run It' with Kubernetes, WebSockets, and Real-Time Terminals

A deep dive into creating a production-grade cloud development environment that dynamically provisions isolated coding workspaces on demand.

Published: December 5, 2025
18 min read
Share:TwitterLinkedIn

Building a Cloud IDE from Scratch: Architecting "Just Run It" with Kubernetes, WebSockets, and Real-Time Terminals

A Deep Dive into Creating a Production-Ready Cloud Development Environment

Have you ever wondered what happens behind the scenes when you click "Create Project" on platforms like Replit, CodeSandbox, or Gitpod? How do they instantly spin up isolated development environments, provide real-time code editing, and deliver a fully functional terminal—all running seamlessly in your browser?

I spent months building Just Run It, a cloud-based IDE that does exactly that. This wasn't just a toy project—it's a production-grade platform that dynamically provisions Kubernetes pods, manages real-time file synchronization via WebSockets, and implements browser-based terminals using pseudo-TTY. In this article, I'll take you through the complete architecture, share the technical decisions I made, reveal the challenges I encountered, and document the hard-won lessons learned.

By the end of this deep dive, you'll understand:

  • How to dynamically provision isolated containers for each user project
  • How to implement real-time file synchronization with WebSockets
  • How to create browser-based terminals with pseudo-TTY
  • How to design a multi-tenant system with Kubernetes
  • The scalability considerations for serving thousands of concurrent users
  • The production gotchas that nobody tells you about

Let's dive in.

The Problem: Why Build a Cloud IDE?

I built Just Run It because I wanted to understand how platforms like Replit, CodeSandbox, and Gitpod actually work under the hood.

What happens when you click "Create Project"? How do they spin up isolated environments in seconds? How do they handle real-time file synchronization? How do they make terminals work in a browser?

These questions led me down a rabbit hole of infrastructure complexity that I was eager to explore:

  • Kubernetes orchestration — How do you dynamically provision containers for thousands of users?
  • Real-time communication — How do you sync file changes across WebSocket connections?
  • Process management — How do you create a real terminal experience in a browser using PTY?
  • Distributed storage — How do you ensure data persistence when containers are ephemeral?
  • Dynamic networking — How do you route traffic to the right container based on subdomains?
  • Multi-tenancy — How do you isolate users while sharing the same infrastructure?

Building a cloud IDE isn't just about creating a product—it's a crash course in distributed systems, container orchestration, and real-time architectures. Every component touches multiple layers of the stack, from the browser's WebSocket connection all the way down to Kubernetes API calls and container runtime.

That complexity is exactly what I wanted to dive into. Just Run It became my vehicle for understanding how modern cloud platforms are architected, one Kubernetes manifest at a time.

Architecture Overview

Just Run It consists of three core microservices orchestrating a Kubernetes cluster, with AWS S3 providing persistent storage:

┌─────────────────────────────────────────────────────────────────────────┐
│                              USER BROWSER                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │ Landing  │  │  Monaco  │  │ xterm.js │  │  Output  │               │
│  │   Page   │  │  Editor  │  │ Terminal │  │  iframe  │               │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘               │
└───────┼─────────────┼─────────────┼─────────────┼───────────────────────┘
        │             │             │             │
        └─────────────┴──────┬──────┴─────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        ▼                    ▼                    ▼
  ┌──────────┐      ┌──────────────┐      ┌──────────────┐
  │   Init   │      │ Orchestrator │      │    NGINX     │
  │ Service  │      │   Service    │      │   Ingress    │
  └────┬─────┘      └──────┬───────┘      └──────┬───────┘
       │                   │                     │
       ▼                   ▼                     ▼
  ┌──────────┐      ┌──────────────┐      ┌──────────────┐
  │  AWS S3  │◄─────│  Kubernetes  │──────►│Runner Pod    │
  │(Storage) │      │     API      │       │(Per Project) │
  └──────────┘      └──────────────┘       └──────────────┘

Each component plays a critical role. Let me break them down.

Service 1: The Init Service — Project Bootstrapping

The Problem: When a user clicks "Create New Project," they need a starting point. Nobody wants to stare at an empty directory, and manually setting up project structures is tedious.

The Solution: The Init Service copies language-specific templates from S3, giving users a fully configured starting point.

The Flow

User selects "Node.js" 
  → Init Service copies template from S3 
  → Project ready in seconds

Implementation

app.post("/project", async (req, res) => {
  const { projectId, language } = req.body;
  
  // Copy template files from S3
  // templates/node-js/* → projects/{projectId}/*
  await copyProjectFolder(
    `templates/${language}`,
    `projects/${projectId}`
  );
  
  return res.send("Project created!");
});

The magic happens in the S3 helper function:

// List all files in the template folder
const listedObjects = await s3.listObjectsV2({
  Bucket: "my-bucket",
  Prefix: "templates/node-js"
}).promise();
 
// Copy each file to the new project location
for (const object of listedObjects.Contents) {
  await s3.copyObject({
    Bucket: "my-bucket",
    CopySource: `my-bucket/${object.Key}`,
    Key: object.Key.replace("templates/node-js", `projects/${projectId}`)
  }).promise();
}

Why S3 Over a Database?

I chose S3 for file storage because:

  • Cost-effective for large files — Pennies per GB versus expensive database storage
  • No size limits — Projects can grow to gigabytes without issues
  • Built-in versioning — Future feature potential without re-architecting
  • Kubernetes native integration — Init containers can directly mount S3

Template Structure

S3 Bucket
├── templates/
│   ├── node-js/
│   │   ├── package.json
│   │   ├── index.js
│   │   └── README.md
│   ├── python/
│   │   ├── requirements.txt
│   │   └── main.py
│   └── react/
│       ├── package.json
│       ├── src/
│       └── public/
└── projects/
    ├── abc123/ ← User's project
    └── xyz789/ ← Another user's project

This structure makes adding new languages trivial—just upload a new template folder to S3.

Service 2: The Orchestrator — Kubernetes Wizardry

This is where the real magic happens. When a user opens their project, the Orchestrator dynamically creates Kubernetes resources to spin up an isolated development environment.

The Challenge

I needed to:

  1. Create a dedicated container for each project
  2. Pre-load project files before the application starts
  3. Expose two endpoints: WebSocket (IDE communication) and HTTP (app output)
  4. Route traffic based on subdomain (project-id.myplatform.com)

The Solution: Dynamic Kubernetes Manifests

Instead of manually creating YAML files for every project, I use a template with placeholders:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: service_name  # ← Placeholder
spec:
  replicas: 1
  template:
    spec:
      # Init container downloads files from S3 BEFORE main container starts
      initContainers:
        - name: copy-s3-resources
          image: amazon/aws-cli
          command: ["/bin/sh", "-c"]
          args:
            - aws s3 cp s3://my-bucket/projects/service_name/ /workspace/ --recursive
          volumeMounts:
            - name: workspace-volume
              mountPath: /workspace
      
      # Main container runs the development environment
      containers:
        - name: runner
          image: my-runner-image:latest
          ports:
            - containerPort: 3001  # WebSocket
            - containerPort: 3000  # HTTP
          volumeMounts:
            - name: workspace-volume
              mountPath: /workspace
          resources:
            requests:
              cpu: "1"
              memory: "1Gi"
            limits:
              cpu: "1"
              memory: "1Gi"

The Orchestrator reads this template, replaces service_name with the actual project ID, and applies it to Kubernetes:

const readAndParseKubeYaml = (filePath, projectId) => {
  const fileContent = fs.readFileSync(filePath, 'utf8');
  
  // Parse multi-document YAML (Deployment + Service + Ingress)
  const docs = yaml.parseAllDocuments(fileContent).map((doc) => {
    let docString = doc.toString();
    // Replace placeholder with actual project ID
    docString = docString.replace(/service_name/g, projectId);
    return yaml.parse(docString);
  });
  
  return docs;
};
 
app.post("/start", async (req, res) => {
  const { projectId } = req.body;
  const manifests = readAndParseKubeYaml("./service.yaml", projectId);
  
  for (const manifest of manifests) {
    switch (manifest.kind) {
      case "Deployment":
        await k8sAppsApi.createNamespacedDeployment("default", manifest);
        break;
      case "Service":
        await k8sCoreApi.createNamespacedService("default", manifest);
        break;
      case "Ingress":
        await k8sNetworkingApi.createNamespacedIngress("default", manifest);
        break;
    }
  }
  
  res.send({ message: "Environment ready!" });
});

The Init Container Pattern

This is one of my favorite Kubernetes patterns. The init container runs before the main container and:

  1. Downloads project files from S3
  2. Places them in a shared volume (/workspace)
  3. Exits successfully
  4. Main container starts with files already in place

Pod Lifecycle:

┌─────────────────────────────────────────────────────────┐
│ 1. Init Container (aws-cli)                            │
│    └── aws s3 cp s3://bucket/projects/abc123/ /workspace│
│                                                         │
│ 2. Init Container exits (success)                      │
│                                                         │
│ 3. Main Container (runner) starts                      │
│    └── /workspace already has all project files!       │
└─────────────────────────────────────────────────────────┘

This pattern is elegant, reliable, and built into Kubernetes. No custom orchestration needed.

Ingress: The Routing Magic

Each project gets two subdomains:

DomainPortPurpose
abc123.justrunit.work.gd3001WebSocket for IDE communication
abc123.justrunit.run.place3000HTTP for viewing app output

The Ingress configuration makes this possible:

apiVersion: networking.k8s.io/v1
kind: Ingress
spec:
  rules:
    - host: abc123.justrunit.work.gd
      http:
        paths:
          - path: /
            backend:
              service:
                name: abc123
                port:
                  number: 3001  # WebSocket
    
    - host: abc123.justrunit.run.place
      http:
        paths:
          - path: /
            backend:
              service:
                name: abc123
                port:
                  number: 3000  # HTTP

Why two domains? Security isolation. The user's running application shouldn't have access to the IDE's WebSocket connection. Separate domains provide clean separation of concerns.

Service 3: The Runner — Where Code Comes Alive

The Runner is the heart of the platform. It runs inside each project's pod and handles:

  • Real-time file operations via WebSocket
  • Terminal emulation with PTY
  • Syncing changes back to S3

WebSocket Events

I use Socket.IO for real-time communication. Here's the event protocol:

EventDirectionPurpose
loadedServer → ClientInitial file tree
fetchDirClient → ServerList directory contents
fetchContentClient → ServerRead file content
updateContentClient → ServerSave file (+ S3 sync)
requestTerminalClient → ServerCreate terminal session
terminalDataBidirectionalTerminal I/O

Implementation

io.on("connection", async (socket) => {
  // Extract project ID from subdomain
  // "abc123.justrunit.work.gd" → "abc123"
  const host = socket.handshake.headers.host;
  const projectId = host?.split('.')[0];
  
  // Send initial file structure
  socket.emit("loaded", {
    rootContent: await fetchDir("/workspace", "")
  });
  
  // File operations
  socket.on("fetchContent", async ({ path }, callback) => {
    const content = await fs.readFile(`/workspace/${path}`, "utf8");
    callback(content);
  });
  
  socket.on("updateContent", async ({ path, content }) => {
    // Save locally (instant feedback)
    await fs.writeFile(`/workspace/${path}`, content);
    
    // Persist to S3 (survives pod restarts!)
    await s3.putObject({
      Bucket: "my-bucket",
      Key: `projects/${projectId}/${path}`,
      Body: content
    }).promise();
  });
});

The Dual-Write Strategy

Every file save triggers two writes:

  1. Local filesystem — Instant feedback for the user
  2. S3 — Durability across pod restarts
OperationLocal FilesystemS3
Read file~1ms~50-200ms
Write file~1ms~100-300ms
List directory~1ms~50-150ms

The local filesystem provides snappy UX, while S3 ensures data survives pod terminations.

The Terminal: PTY Magic

This was the trickiest part of the entire project. Browsers can't run bash directly, so I use node-pty to create pseudo-terminals.

What is a PTY?

A pseudo-terminal is a pair of virtual devices:

  • Master side: Controlled by our application
  • Slave side: Looks like a real terminal to programs (bash, vim, etc.)

When you run bash attached to a PTY, it behaves exactly like it would in a real terminal—supporting colors, cursor movement, job control, and more.

Architecture

┌───────────┐         ┌───────────┐         ┌──────────┐
│ xterm.js  │◄────────►│ Socket.IO │◄────────►│ node-pty │
│ (Browser) │ WebSocket│ (Server)  │   IPC    │  (PTY)   │
└───────────┘         └───────────┘         └────┬─────┘
                                                  │
                                                  ▼
                                            ┌───────────┐
                                            │   bash    │
                                            │ (process) │
                                            └───────────┘

Implementation

import { spawn } from 'node-pty';
 
class TerminalService {
  private sessions: Map<string, IPty> = new Map();
  
  createPty(socketId: string, onData: (data: string) => void) {
    // Spawn a real bash process
    const pty = spawn('bash', [], {
      name: 'xterm-256color',
      cols: 80,
      rows: 24,
      cwd: '/workspace',
      env: {
        ...process.env,
        PS1: '\\u@runner:\\w$ '  // Custom prompt
      }
    });
    
    // Stream output to client
    pty.onData((data) => onData(data));
    
    this.sessions.set(socketId, pty);
    return pty;
  }
  
  write(socketId: string, data: string) {
    // Forward keystrokes to bash
    this.sessions.get(socketId)?.write(data);
  }
}

On the frontend, xterm.js renders the terminal:

// Frontend
socket.emit("requestTerminal");
 
socket.on("terminal", ({ data }) => {
  // Render output in xterm.js
  terminal.write(data);
});
 
terminal.onData((data) => {
  // Send keystrokes to server
  socket.emit("terminalData", { data });
});

The result? A fully functional bash terminal in the browser:

user@runner:/workspace$ npm install
added 150 packages in 3.2s
 
user@runner:/workspace$ node index.js
Server running on port 3000

Signal Handling

Real terminals support signals like Ctrl+C (SIGINT) and Ctrl+Z (SIGTSTP). These work automatically with PTY because the terminal driver handles them:

User presses Ctrl+C
      ↓
xterm.js sends: "\x03" (ASCII ETX)
      ↓
Socket.IO transmits to server
      ↓
node-pty writes "\x03" to PTY master
      ↓
Terminal driver interprets as SIGINT
      ↓
bash sends SIGINT to foreground process
      ↓
Process terminates (or handles signal)

The Complete Data Flow

Let me walk through what happens when a user creates and uses a project:

Phase 1: Project Creation

  1. User clicks "Create Node.js Project"
  2. Frontend → POST /project { projectId: "abc123", language: "node-js" }
  3. Init Service copies S3: templates/node-js/*projects/abc123/*
  4. Frontend navigates to /coding?projectId=abc123

Phase 2: Environment Provisioning

  1. Frontend → POST /start { projectId: "abc123" }
  2. Orchestrator creates Kubernetes resources:
    • Deployment (with init container + runner)
    • Service (internal networking)
    • Ingress (domain routing)
  3. Kubernetes schedules pod on a node
  4. Init container runs: aws s3 cp/workspace/
  5. Runner container starts

Phase 3: Real-Time Coding

  1. Frontend connects: ws://abc123.justrunit.work.gd
  2. Runner sends file tree via loaded event
  3. User clicks file → fetchContent → Monaco Editor displays it
  4. User edits → updateContent → Local save + S3 sync
  5. User opens terminal → requestTerminal → PTY spawned
  6. User types "npm start" → terminalData → bash executes
  7. App runs on port 3000 → visible at abc123.justrunit.run.place

Scalability: How Many Users Can This Handle?

This is the million-dollar question. Let's break it down.

Resource Requirements Per Project

Each project pod requests:

  • 1 CPU core
  • 1 GB RAM

Cluster Capacity

Cluster SizeNode SpecsConcurrent ProjectsUse Case
Small3 nodes × (4 CPU, 16GB)~30-40Development/Testing
Medium10 nodes × (8 CPU, 32GB)~150-200Small startup
Large50 nodes × (16 CPU, 64GB)~1,000+Growing platform
Enterprise200+ nodes~5,000+Full scale

Bottlenecks & Solutions

BottleneckImpactSolution
Ingress ControllerSingle entry pointDeploy multiple replicas, use cloud LB
Orchestrator ServiceK8s API calls are slowAdd caching, queue requests
S3 Rate Limits3,500 PUT/s per prefixShard by project ID prefix
Pod Startup Time10-30 secondsPre-warm pool of pods

Cost Optimization

At scale, costs matter. Here's what I'd implement:

  • Idle Detection — Terminate pods after 30 minutes of inactivity
  • Spot Instances — Use preemptible nodes for 60-80% cost savings
  • Right-sizing — Offer different tiers (0.5 CPU for small projects)
  • Cold Storage — Archive inactive projects to S3 Glacier

Networking Deep Dive

One of the most complex aspects is networking. Each project needs its own subdomain, and we need to handle both WebSocket and HTTP traffic differently.

Wildcard DNS: The Foundation

Instead of creating a DNS record for every project, I use wildcard DNS:

*.justrunit.work.gd → Load Balancer IP
*.justrunit.run.place → Load Balancer IP

This means abc123.justrunit.work.gd, xyz789.justrunit.work.gd, and any other subdomain all resolve to the same IP. The routing to the correct pod happens at the Ingress layer.

NGINX Ingress Controller: Traffic Cop

The NGINX Ingress Controller inspects the Host header to determine which pod to route to:

Request: GET / HTTP/1.1
Host: abc123.justrunit.work.gd
Connection: Upgrade
Upgrade: websocket

┌─────────────────────────────────────────────────────────┐
│ NGINX Ingress Controller                                │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Termination (decrypt HTTPS)                      │
│ 2. Parse Host header: "abc123.justrunit.work.gd"        │
│ 3. Look up Ingress rules for this host                  │
│ 4. Find: route to Service "abc123" port 3001            │
│ 5. Detect WebSocket upgrade, maintain connection        │
│ 6. Forward to pod IP (from Service endpoints)           │
└─────────────────────────────────────────────────────────┘
                         ↓
              ┌─────────────────┐
              │  Pod: abc123    │
              │  Port: 3001     │
              └─────────────────┘

TLS Certificates at Scale

Managing SSL certificates for thousands of subdomains sounds nightmarish, but wildcard certificates make it simple:

spec:
  tls:
    - hosts:
        - "*.justrunit.work.gd"
      secretName: wildcard-work-gd-tls
    - hosts:
        - "*.justrunit.run.place"
      secretName: wildcard-run-place-tls

I use cert-manager with Let's Encrypt to automatically provision and renew these certificates.

Production Considerations

Building the core functionality is one thing. Running it in production is another.

Monitoring & Observability

A distributed system needs comprehensive monitoring. Key metrics I track:

# Resource usage
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- nginx_ingress_controller_requests_total

# Application metrics
- socket_io_connected_clients
- terminal_sessions_active
- s3_operations_total
- pod_startup_duration_seconds

Error Handling

Every external call needs robust error handling:

socket.on("updateContent", async ({ path, content }) => {
  try {
    await fs.writeFile(`/workspace/${path}`, content);
    
    try {
      await s3.putObject({...}).promise();
    } catch (s3Error) {
      // S3 failure shouldn't break UX
      logger.error('s3_sync_failed', { path, error: s3Error.message });
      
      // Queue for retry
      retryQueue.add({ path, content, projectId });
    }
  } catch (fsError) {
    socket.emit('error', { message: 'Failed to save file' });
    logger.error('file_save_failed', { path, error: fsError.message });
  }
});

Graceful Shutdown

When a pod is terminated, clean up gracefully:

process.on('SIGTERM', async () => {
  logger.info('shutdown_initiated', {});
  
  // Stop accepting new connections
  io.close();
  
  // Give existing operations time to complete
  await new Promise(resolve => setTimeout(resolve, 5000));
  
  // Close all terminal sessions
  terminalService.closeAll();
  
  // Flush any pending S3 writes
  await retryQueue.flush();
  
  process.exit(0);
});

Security Hardening

Security is non-negotiable for a platform that runs arbitrary user code.

Container Isolation:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true  # Except /workspace

Network Policies (prevent pods from communicating with each other):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8       # Block internal network
              - 172.16.0.0/12
              - 192.168.0.0/16

Resource Limits (prevent resource exhaustion):

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

Lessons Learned

1. Init Containers Are Underrated

The init container pattern solved my biggest challenge: how to pre-populate the filesystem before the app starts. It's elegant, reliable, and built into Kubernetes. No custom orchestration needed.

2. WebSockets Need Careful Error Handling

Connections drop. Networks fail. I learned to implement:

  • Automatic reconnection with exponential backoff
  • Message queuing during disconnects
  • Heartbeat pings to detect dead connections

3. PTY Is Not Just "Running Commands"

A real terminal needs:

  • Proper signal handling (Ctrl+C, Ctrl+Z)
  • Window resize events
  • ANSI escape code support
  • Session persistence

4. Multi-Tenancy Is Hard

Isolating users requires thinking about:

  • Resource limits (CPU, memory, disk)
  • Network policies (prevent cross-pod communication)
  • Filesystem isolation (each pod has its own /workspace)
  • Process isolation (containerization handles this)

5. Persistence Strategy Matters

I chose S3 because:

  • Pods are ephemeral—they can be killed anytime
  • S3 provides durability (11 9's)
  • Init containers make S3 → Pod sync seamless
  • Real-time sync keeps S3 updated

What I'd Do Differently

If I were starting over:

Use a Message Queue — Decouple the Orchestrator from synchronous K8s API calls. RabbitMQ or Redis Streams would make the system more resilient.

Implement Pod Pooling — Pre-create a pool of warm pods to reduce startup latency from 30 seconds to <2 seconds.

Cost Analysis

Let's talk money. Running a cloud IDE isn't cheap.

Per-Project Costs (AWS, us-east-1)

ResourceSpecificationMonthly Cost
EC2 (pod)1 vCPU, 1GB RAM~$7.50
S3 Storage100MB project~$0.0023
Data Transfer~1GB/month~$0.09

Total per active project: ~$7.50/month

Platform Costs (Fixed)

ResourceSpecificationMonthly Cost
EKS Control PlaneManaged Kubernetes$72
Load BalancerNetwork LB$16
NAT GatewayOutbound traffic$32
Init/Orchestrator nodes2× t3.medium$60

Fixed monthly cost: ~$180

Break-Even Analysis

Fixed costs: $180/month
Per-project cost: $7.50/month

At $10/user/month pricing:
Break-even = 180 / (10 - 7.50) = 72 users

At $15/user/month pricing:
Break-even = 180 / (15 - 7.50) = 24 users

Conclusion

Building Just Run It has been an incredible learning journey. What started as curiosity about "how does Replit work?" turned into a deep dive through:

  • Kubernetes orchestration and dynamic resource management
  • Real-time systems with WebSockets and event-driven architecture
  • Process management with pseudo-terminals
  • Distributed storage patterns with S3
  • Multi-tenant security and isolation

Tech Stack Summary

Frontend:

  • React
  • Monaco Editor (VS Code editor)
  • xterm.js (terminal emulation)
  • Socket.IO Client

Backend:

  • Node.js, Express, TypeScript
  • Socket.IO (real-time communication)
  • node-pty (pseudo-terminal)

Infrastructure:

  • Kubernetes (container orchestration)
  • NGINX Ingress Controller
  • Docker
  • AWS S3 (persistent storage)
  • kubernetes-client-node (K8s API)

Harsh Mange

Written by Harsh Mange

Software Engineer passionate about building scalable backend systems and sharing knowledge through writing.

Share:TwitterLinkedIn