TL;DR
This post covers the architecture behind Dataism, a production AI content generation platform I built that processes 100+ characters per hour using ComfyUI as the rendering engine. The system chains image generation, LoRA fine-tuning, voice cloning, and video synthesis into automated pipelines — all orchestrated through a FastAPI backend with WebSocket-driven real-time monitoring.
TL;DR
This post covers the architecture behind Dataism, a production AI content generation platform I built that processes 100+ characters per hour using ComfyUI as the rendering engine. The system chains image generation, LoRA fine-tuning, voice cloning, and video synthesis into automated pipelines — all orchestrated through a FastAPI backend with WebSocket-driven real-time monitoring.
The Problem: Content at Scale Without Losing Identity
Most AI image generation tutorials end at "here's how to generate a single image." That is the easy part. The hard part is generating hundreds of consistent characters — each with a persistent identity, unique voice, and video presence — without manual intervention.
When I started building Dataism, the requirements were clear:
- Generate batches of characters across diverse types (K-Pop idols, fitness athletes, influencers, rappers, testimonial personas)
- Each character needs identity persistence — the same face across different poses, outfits, and contexts
- Characters need voices (cloned or designed) and video content (lip-synced, animated)
- Everything must run autonomously with minimal human oversight
- The system should sustain 48+ hours of continuous operation without failure
ComfyUI turned out to be the right engine for this — not because of its UI, but because of its node-based workflow system that can be driven entirely through its API.
Architecture Overview
The system has four layers:
┌──────────────────────────────────┐
│ Next.js Dashboard │ Real-time monitoring, batch controls
│ (Redux + WebSocket) │
├──────────────────────────────────┤
│ FastAPI Backend │ REST API, job orchestration, scheduler
│ (Async Workers) │
├──────────────────────────────────┤
│ Pipeline Services │ Z-Image, FLUX2, SDXL pipelines
│ (LoRA Manager, Prompts) │
├──────────────────────────────────┤
│ ComfyUI Engine │ 30+ workflows, GPU execution
│ (Flux, SDXL, VibeVoice) │
└──────────────────────────────────┘
The backend never touches pixel data directly. It constructs workflow JSON, injects parameters, sends it to ComfyUI's API, and monitors execution through WebSocket events. ComfyUI handles all GPU-bound work.
Driving ComfyUI Programmatically
ComfyUI's real power is not its drag-and-drop UI — it is the fact that every workflow is a JSON graph of nodes that can be manipulated programmatically. Each node has an ID, a class type, and inputs that reference other nodes by ID.
Here is what a simplified character generation workflow looks like when you strip away the UI:
{
"1": {
"class_type": "CLIPLoader",
"inputs": {
"clip_name": "mistral_3_small_flux2_bf16.safetensors"
}
},
"2": {
"class_type": "UNETLoader",
"inputs": {
"unet_name": "Flux2_dev_fp8mixed.safetensors"
}
},
"5": {
"class_type": "CLIPTextEncode",
"inputs": {
"text": "Professional portrait of a 23-year-old woman...",
"clip": ["1", 0]
}
},
"8": {
"class_type": "SamplerCustomAdvanced",
"inputs": {
"noise": ["6", 0],
"guider": ["7", 0],
"sampler": ["9", 0],
"sigmas": ["10", 0],
"latent_image": ["11", 0]
}
}
}
The key insight: you can load a workflow template, swap out the prompt text, change the seed, adjust guidance values, inject a LoRA model path, and queue it — all without ever opening the ComfyUI interface.
Our backend does exactly this:
async def run_workflow(workflow_path: str, params: dict) -> str:
"""Load a workflow template, inject parameters, queue it."""
with open(workflow_path) as f:
workflow = json.load(f)
# Inject prompt into the CLIPTextEncode node
workflow["5"]["inputs"]["text"] = params["prompt"]
# Set the seed for reproducibility
workflow["6"]["inputs"]["noise_seed"] = params.get("seed", random.randint(0, 2**32))
# If using LoRA, inject the model name
if params.get("lora_name"):
workflow["12"]["inputs"]["lora_name"] = params["lora_name"]
# Queue on ComfyUI
response = await httpx.AsyncClient().post(
f"{COMFYUI_HOST}/prompt",
json={"prompt": workflow}
)
return response.json()["prompt_id"]
The Three-Pipeline Architecture
Not all characters are created equal. Realistic characters (K-Pop idols, fitness athletes) need different rendering than stylized ones (mascots, cartoons). We run three independent pipelines:
Z-Image Pipeline (Realistic Characters)
This is the primary pipeline for photorealistic characters. It runs a four-stage process:
- Base Generation — Flux Schnell generates a fast draft image from the character prompt
- Variation Refinement — Z-Image Turbo refines the draft into 20 high-quality variations with diverse poses, expressions, and contexts
- LoRA Training — The 20 variations become training data for a character-specific LoRA, giving identity persistence
- LoRA Testing — Generate test images with the trained LoRA to validate identity consistency
The entire four-stage pipeline runs autonomously. Drop in a character type and name, and 30 minutes later you have a trained LoRA that can reproduce that character's face in any context.
FLUX2 Pipeline (Alternative Realistic)
Uses Flux2 Dev for two-stage text-to-image followed by image-to-image refinement. Same quality target as Z-Image but with different aesthetic characteristics.
SDXL Pipeline (Stylized Characters)
For mascots, cartoons, and illustrated characters where photorealism is not the goal. SDXL's strength in stylized outputs makes it the right choice here.
Each pipeline has its own service class with per-job context tracking, so multiple characters can be processed in parallel without state conflicts.
LoRA Training and Lifecycle Management
LoRA (Low-Rank Adaptation) is what gives each character a persistent identity. Without it, generating the same character twice would produce two different people. With a trained LoRA, you can generate "Yuna in a coffee shop" and "Yuna on stage" and get the same recognizable person.
The challenge is managing LoRA lifecycle at scale. When you are generating dozens of characters, you need to:
- Train LoRAs automatically from generated variations
- Store them in a permanent location (not ComfyUI's working directory)
- Mount them on-demand when generating new content for that character
- Unmount them when done to avoid polluting ComfyUI's model list
- Handle cross-platform differences (symlinks on Linux/macOS, file copies on Windows)
Our LoRA Manager handles all of this with a context manager pattern:
@dataclass
class LoRARecord:
character_name: str
character_type: str
version: int
filename: str
permanent_path: str
is_mounted: bool
size_bytes: int
async with lora_manager.mounted("Yuna", lora_path) as filename:
workflow["lora_node"]["inputs"]["lora_name"] = filename
await run_workflow(workflow)
# LoRA is automatically unmounted after generation
The manager also runs cleanup on startup, removing stale symlinks from previous sessions that may have crashed.
Prompt Engineering at Scale
Generating 100+ characters per hour means you cannot write prompts by hand. Our DualModelPromptBuilder generates context-aware prompts based on character metadata:
class DualModelPromptBuilder:
def build_base_prompt(self, character_type, gender, age, name):
"""Generate the initial character prompt."""
# Base: realistic, high-contrast photography
# Variations: different poses, emotions, lighting
# Training captions: descriptive labels for LoRA fine-tuning
# Test prompts: validation of identity consistency
The builder generates four types of prompts from a single character definition:
- Base prompts — High-contrast photographic style for initial generation
- Variation prompts — Diverse poses, expressions, outfits, and contexts
- Training captions — Descriptive labels paired with each variation image for LoRA fine-tuning
- Test prompts — Novel scenarios to validate the trained LoRA maintains identity
Every prompt enforces "full body visible" to prevent cropped training data, and uses gender-appropriate language throughout.
Voice Cloning and Audio
Characters need voices. The system integrates two approaches:
VibeVoice (Speaker-Adaptive Cloning)
VibeVoice takes a reference audio clip and generates new speech in that voice. We use the 1.5B parameter model (5.4 GB) for production, with larger models available for higher fidelity:
{
"class_type": "VibeVoiceSingleSpeakerNode",
"inputs": {
"audio": ["audio_loader", 0],
"text": "Hello, I'm your new AI assistant.",
"model_name": "VibeVoice-1.5B"
}
}
Qwen TTS (Voice Design)
For characters that need a designed voice rather than a cloned one, Qwen2 TTS generates speech from text with configurable voice parameters.
SoulX Singer (Voice Conversion)
For musical content, SoulX converts existing songs into a character's voice — enabling AI-generated music videos with consistent character voices.
The audio pipeline chains these together: generate or clone a voice, create a song in that voice, and feed both into the video generation pipeline.
Video Generation
Static images are not enough. The system generates video content through three engines:
WAN Animate 2.2
Frame-based animation from a single reference image. Takes a character image and an animation prompt (walking, dancing, talking) and generates a short video clip. The Painter variant enables long-form video generation.
InfiniteTalk (Lip-Sync)
The most impressive pipeline. InfiniteTalk takes a character image and audio (either pre-recorded or cloned via VibeVoice) and generates a talking-head video with accurate lip synchronization.
Character Image + Cloned Audio → InfiniteTalk → Lip-Synced Video
This is the pipeline that enables AI-generated content creators — a fully synthetic person speaking in a consistent voice.
LTX (Performance Video)
For performance-oriented content (dancing, stage presence), LTX generates higher-motion video sequences.
Video Post-Processing
Individual clips are combined, upscaled, and composited using utility workflows. The video-combine workflow stitches multiple clips into a single output, and the utility_video_upscale workflow enhances resolution.
Batch Processing and CSV Import
For production runs, individual character creation is too slow. The system supports two batch modes:
Auto-Generation
Specify a character type and count, and the system auto-generates names, ages, and genders:
POST /api/character-creation/batch
{
"character_type": "kpop_idol",
"count": 10,
"group_name": "STELLAR" # Optional: creates as K-Pop group
}
For K-Pop groups, the system automatically assigns roles (Leader, Main Vocalist, Main Rapper, Main Dancer, Visual, Maknae) and manages gender composition patterns.
CSV Batch
Upload a CSV with character specifications, and the system fills in any missing fields:
character_type,character_name,age,gender,custom_prompt
kpop_idol,Yuna,23,female,wearing pink stage outfit
rapper,,,,
fitness_athlete_female,Ashley,28,female,
Missing names are auto-generated from contextual name pools (Korean names for K-Pop, stage names for rappers). Missing genders are inferred from character type. Missing ages fall within type-appropriate ranges.
Name uniqueness is guaranteed across the entire system — checking existing folders on disk, current batch session memory, and applying numeric suffixes when collisions occur.
Deduplication with Perceptual Hashing
When generating hundreds of images, duplicates happen. Same seed plus similar prompt equals near-identical output. We catch these with perceptual hashing (pHash):
class ImageQualityChecker:
def check_duplicate(self, image_path: str) -> bool:
"""Compare pHash against all previously generated images."""
new_hash = compute_phash(image_path)
for existing_hash in self.hash_store:
if hamming_distance(new_hash, existing_hash) < threshold:
return True # Duplicate detected
self.hash_store.add(new_hash)
return False
Detected duplicates trigger automatic regeneration with a new seed. Combined with seed tracking, this ensures every output is visually unique.
Real-Time Monitoring
The Next.js dashboard connects via WebSocket for real-time updates:
- Multi-channel support — Character pipeline, CSV batch, and group creation each have independent channels
- Live progress — Current count, total, percentage, estimated time remaining
- Log streaming — Every pipeline event (generation started, LoRA training complete, duplicate detected) streams to the dashboard in real time
- Channel-aware stop control — Stopping a CSV batch does not interrupt a running character pipeline
The WebSocket middleware in Redux manages connection state and dispatches events to the appropriate slice:
// WebSocket middleware handles multi-channel routing
case "character_pipeline":
dispatch(updateCharacterProgress(data));
break;
case "csv_batch":
dispatch(updateBatchProgress(data));
break;
case "group_creation":
dispatch(updateGroupProgress(data));
break;
Channel-Aware Stop Control
This was one of the trickier engineering problems. When a user clicks "Stop" on a CSV batch, you need to:
- Find all ComfyUI prompts that belong to the
csv_batchchannel - Mark them as cancelled in our tracking system
- Check if the currently running ComfyUI prompt belongs to this channel
- If yes — send an interrupt to ComfyUI
- If no (it belongs to
character_pipeline) — do NOT interrupt, just cancel pending jobs - Remove pending jobs for this channel from ComfyUI's queue
Without prompt ownership tracking, a naive "cancel everything" approach would kill unrelated jobs running on different channels.
Scheduler and Automation
For continuous content production, the system includes a daily scheduler:
POST /api/automation/enable?time=09:00&timezone=UTC
The scheduler triggers batch generation at the configured time, runs through the full pipeline (generation → training → testing), and logs results. It has been tested for 48+ hours of uninterrupted operation.
Lessons Learned
ComfyUI is an Engine, Not a UI
The drag-and-drop interface is for prototyping workflows. In production, ComfyUI is a GPU execution engine that you drive through its API. Design your workflows in the UI, export them as JSON, and never open the UI again.
LoRA Training is the Identity Layer
Without LoRA, you are generating random people. With LoRA, you are generating a specific person in novel contexts. The training → mount → use → unmount lifecycle needs to be airtight, especially at scale.
Channel Isolation is Non-Negotiable
The moment you have multiple concurrent pipelines sharing a single ComfyUI instance, you need prompt ownership tracking. Without it, stop commands become weapons of mass destruction.
Perceptual Hashing Saves Storage and Credibility
At 100+ characters per hour, you will generate duplicates. Catching them before they hit storage saves disk space. Catching them before they reach a client saves credibility.
WebSocket is the Right Choice for Progress
Polling an API every second for batch progress is wasteful and laggy. WebSocket gives you real-time updates with minimal overhead. For content generation where individual jobs take 10-60 seconds, the real-time feedback matters.
What is Next
The pipeline is evolving toward full autonomy — generating characters, training their LoRAs, cloning their voices, producing video content, and publishing — all from a single CSV upload. The pieces are in place. The integration work continues.
If you are building AI content pipelines and want to discuss architecture, feel free to book a call.



