Main technical risks 1. Real-ESRGAN first, on every image, is your biggest quality risk Running every image through: 4× ESRGAN then downscale back to original size can definitely improve some photos, but it can also introduce: hallucinated texture crispy foliage waxy skin after interaction with later steps fake edge detail zippering around fine geometry over-defined JPEG blocks on already compressed Fuji JPEGs This is the part I would treat as conditionally applied, not universal. My recommendation: gate ESRGAN based on image characteristics or at least use different strength paths for portrait vs landscape vs night Examples: portraits: maybe skip global ESRGAN or use a weaker path night/high-ISO: be careful, because ESRGAN can turn noise into invented detail landscapes/architecture: often benefit the most Right now the pipeline assumes “seen at 16K then downscaled” is always a win. It often is not. 2. CodeFormer after global enhancement can amplify inconsistency CodeFormer is useful, but it can produce faces that look slightly detached from the rest of the frame if the global pipeline has already altered texture and local contrast. Potential issues: face crops look cleaner than surrounding skin/neck/hair restored face sharpness conflicts with depth blur/sharpen later multiple faces in one frame may get uneven treatment Things to consider: apply CodeFormer only when face size exceeds a threshold use a lower-strength/fidelity profile depending on scene skip CodeFormer for distant faces log face count and face bounding-box size into metadata That would make the workflow easier to debug when faces look “too AI.” 3. Scene classification using 8 CLIP prompts is clever but brittle This is a nice lightweight idea, but it is likely the weakest decision point in the pipeline because eight prompts force coarse categorization. Possible failure cases: beach sunset might oscillate between beach, golden_hour, and landscape indoor portraits near a window may flip to portrait or indoor urban night scenes may misclassify between street and night cloudy mountain lake might be overcast vs landscape Because your grade profile changes exposure/contrast/saturation/detail/denoise, a wrong label can materially alter the image. Better approach: store the full prompt score distribution, not just argmax use top-2 or top-3 labels blend profiles based on confidence instead of hard-switching For example: 60% landscape + 40% golden_hour instead of forcing one profile That would reduce sudden profile mistakes. 4. CPU image ops at 4K are fine, but not yet optimized as a pipeline Your CPU-bound stages are sensible, but there are some efficiency concerns: guidedFilter and morphology/blur passes at full 4K are not trivial ImageScaleBy 16K → 4K on CPU may be heavier than it looks repeated color-space conversions and full-frame copies can become memory-bandwidth bound if you later parallelize multiple photos, CPU becomes the bottleneck before GPU memory does This matters because your throughput is already 40–50s/photo, and if you batch more aggressively you may saturate host CPU. I would especially watch: OpenCV allocations Python ↔ tensor conversion overhead inside custom nodes whether large intermediate tensors are duplicated unnecessarily 5. Polling /history/ every 2s is workable but not ideal It is acceptable, but it is a weak point operationally. Risks: stale/incomplete history states long-run prompt ambiguity if ComfyUI restarts polling delay adds latency harder recovery when output partially exists but metadata doesn’t If ComfyUI or your wrapper supports websocket progress or event-driven status, that would be better. If not, I would at least strengthen state validation: ensure expected output files exist and are complete ensure metadata JSON corresponds to the same prefix distinguish timeout from partial success Biggest architectural improvement opportunities 1. Add conditional routing, not one fixed pipeline for every photo Right now the graph is elegant, but it is still mostly single-path. A more robust system would route based on detected attributes: no faces → skip CodeFormer little/no sky → skip SkyEnhance low confidence scene label → use default conservative grade low-detail or noisy photo → reduce or skip ESRGAN already high-contrast/high-saturation image → apply weaker grade That would reduce over-processing and save time. 2. Move from hardcoded profiles to measured image statistics Your scene profiles are sensible, but they are still hand-tuned guesses. A stronger next step would be to incorporate measured stats such as: luminance histogram highlight clipping ratio shadow floor occupancy saturation percentile edge density noise estimate face area percentage sky coverage Then use those stats to modulate: exposure saturation detail multiplier denoise background blur That would make the pipeline more adaptive and less prompt-dependent. 3. Preserve and restore metadata more deliberately You correctly bake orientation before upload because ComfyUI strips EXIF. Good. But converting final PNG to JPEG without explicit metadata handling means you may be losing: original EXIF fields capture time lens/camera info ICC profile GPS if present copyright/author data That may be fine, but if the intent is “enhanced derivative of original photo,” I would consider: copying selected EXIF fields from source to final JPEG preserving or explicitly assigning ICC profile adding software tag / processing note optionally stripping privacy-sensitive fields by choice, not by accident Color profile handling is especially important. “No colour corrections” is not the same as “color managed.” 4. Add resumability per stage, not just per photo Your manifest marks a photo done after full completion, which is good, but partial reruns still require redoing all remote processing for failed photos. You could get stronger resilience with stage-aware artifacts: oriented temp exists upload completed prompt submitted output downloaded JPEG written metadata written That might be too much overhead for a personal workflow, but even just logging prompt_id per source photo would help a lot with crash recovery. 5. Treat JPEG as an output format decision, not a fixed end state JPEG quality 92 is reasonable, but for some images: foliage gradients in skies deep edits after enhancement JPEG may reintroduce artifacts after all that expensive work. Consider: archival output as PNG or TIFF delivery output as JPEG optional WebP/AVIF for web usage Even if you keep JPEG as primary, having a “master enhanced output” option would be useful. Specific comments on the custom stages AdaptivePhotoGrade This is the most promising custom logic in the workflow. Good: exposure in linear light contrast and saturation as explicit steps detail/base decomposition per-scene profiles Concerns: gamma 2.2 approximation is simple, but true sRGB transfer is not exactly 2.2 clipping highlights at 1.0 can lose recoverable rolloff smoothness HSV saturation edits can behave poorly in skin tones and near highlights fixed midpoint contrast around 0.5 is simple but not content-aware If you keep evolving it, the next quality wins will likely come from: proper sRGB transfer functions luminance-aware saturation highlight/shadow selective controls local contrast constrained by noise estimate SkyEnhance Clever and cheap. Good for a CPU stage. Risks: blue clothing, windows, water, reflective buildings, and tinted glass can get caught sunset banding or haloing near trees/buildings vertical prior helps, but can still fail on mountains or upside-weighted compositions I would recommend logging: sky coverage % mean mask confidence whether sky enhancement was effectively skipped And maybe auto-disable when coverage is too low or too fragmented. DepthSelectiveSharpen This is an interesting stage, but also easy to overdo. Pros: more photographic than simple global sharpening can add subject separation Risks: relative depth is not segmentation hair, glasses, transparent objects, fences, and fine branches can create messy transitions background blur on an already naturally focused image may look synthetic blur-plus-sharpen in one stage can produce “smartphone portrait mode” artifacts I would strongly consider making this more conservative: lower default blur maybe sharpen foreground only, without explicit background blur or gate blur by scene type and depth confidence For many photos, foreground sharpening alone may be enough. Performance review Your breakdown is believable. The biggest performance cost drivers are probably: ESRGAN 4× inference memory movement around the 16K intermediate downscale from 16K to 4K Depth Anything inference This means the obvious speed/quality tradeoff lever is: reducing or conditionally skipping the 4× path That one decision could cut runtime materially. If you want better throughput later, likely gains are: batch submission queue with bounded concurrency reuse loaded models across jobs, which ComfyUI already helps with avoid oversized intermediates when not needed possibly move some CPU image ops to GPU if they become limiting But honestly, for 45 photos, the current runtime is already acceptable. Operational review This is better than average for reliability, but I would still tighten a few things: Add stronger failure modes Include distinct handling for: upload success but prompt submission failure prompt accepted but no output file PNG written but metadata JSON missing metadata JSON malformed partial download ComfyUI restart mid-run Record more run provenance In the metadata or manifest, store: prompt_id workflow hash custom node git commit or file hash model file hashes processing duration per stage if possible That will make future tuning much easier. Be careful with cache busting The random prefix solves ComfyUI cache problems, which is practical. But it also removes some chance of safe reuse. A better long-term version is: cache key derived from source hash + workflow hash + relevant params Then you can intentionally reuse identical results. Best improvements to do next If I were prioritizing, I would do these in order: make ESRGAN conditional store CLIP confidence distribution and blend scene profiles make depth stage more conservative preserve ICC/selected EXIF metadata improve failure recovery and provenance logging Those would give the best quality/reliability gains without rewriting the whole system. Bottom line This is a well-designed, thoughtful workflow with a good systems mindset behind it. The main weakness is not architecture — it is over-processing risk from always-on enhancement stages, especially: full-image ESRGAN hard scene-class selection depth-based blur/sharpen So my verdict is: architecture: strong operational design: strong quality strategy: promising, but should become more conditional and confidence-aware performance: acceptable for the workload next maturity step: adaptive routing and stronger provenance If you want, I can turn this into a more formal engineering review with sections for “keep / change / risks / recommended roadmap,” or I can propose a v2 pipeline that should improve both quality and speed. yes: # Photo Enhancement Workflow Review ## Executive summary This workflow is well designed. It has a strong separation between orchestration, inference, deterministic image processing, and human review. The system is practical, understandable, and already close to a small production pipeline for batch photo enhancement. The main issue is not the infrastructure or code shape. The main issue is **quality control under an always-on enhancement strategy**. Several expensive stages are applied to every image, even though their benefit is scene-dependent and sometimes negative. The biggest gains now will come from making the workflow **conditional, confidence-aware, and slightly more conservative**. --- # What should stay ## 1. Ruby as the control plane This is a good choice. It gives you: * clean batch orchestration * simple manifest handling * file lifecycle control * easy VM lifecycle integration * a place to keep business logic out of ComfyUI ## 2. ComfyUI as the execution graph Also a good choice. It gives you: * model reuse * visual graph structure * easy injection of runtime parameters * modular custom node expansion ## 3. Metadata sidecar generation This is one of the strongest parts of the system. The `_e.md` and JSON sidecars make the workflow: * debuggable * reviewable * reproducible * easier to tune later ## 4. Human review tool The comparison tool is exactly the right final step. Enhancement pipelines often fail because they assume “processed” means “better.” Yours does not. ## 5. EXIF orientation bake before upload Correct and necessary. Good defensive engineering. --- # What should change ## 1. Stop treating enhancement as a single fixed path Right now the graph is elegant, but too uniform. The workflow should become a **decision tree**, not a single mandatory sequence. Some stages should be optional: * Real-ESRGAN * CodeFormer * SkyEnhance * DepthSelectiveSharpen * grading strength inside AdaptivePhotoGrade ## 2. Make Real-ESRGAN conditional This is the highest-priority change. Current risks: * synthetic texture * over-crisp foliage * JPEG artifact amplification * invented microdetail * unnatural skin/hair ### Recommendation Use ESRGAN only when: * high detail scenes (landscape, architecture) * strong edge density * visible softness or compression Avoid or weaken for: * portraits * night/high ISO * already sharp JPEGs ## 3. Replace hard scene labels with blended grading Current approach uses argmax from CLIP. Problem: scenes are often mixed. ### Recommendation * keep top 2–3 scene scores * normalize * blend profile parameters Example: * 0.55 landscape * 0.35 golden_hour * 0.10 overcast Blend exposure, contrast, saturation, detail, denoise. ## 4. Make depth processing more conservative Default behavior should be: * foreground sharpening only * no background blur by default Enable blur only when: * strong subject separation * portrait-like composition ## 5. Preserve metadata intentionally Current pipeline likely loses: * EXIF * ICC profile ### Recommendation Preserve or explicitly manage: * capture timestamp * camera/lens info * ICC profile * add processing metadata --- # Main risks ## Quality risks ### Over-processing Stacked enhancements may lead to synthetic look. ### Face inconsistency CodeFormer may produce mismatch with surrounding regions. ### Masking errors Sky and depth masks may: * misclassify regions * create halos ## Operational risks ### Partial success ambiguity Need stronger validation for: * missing metadata * partial downloads ### Weak provenance Should log: * prompt_id * workflow hash * model versions ### CPU bottleneck Potential hotspots: * large rescaling * guided filtering * morphology operations --- # Performance review ## Current state ~40–50s/photo is acceptable. ## Main optimization lever Make ESRGAN conditional. ## Secondary lever Skip unnecessary stages when not needed. --- # Recommended v2 architecture ## Goal Make workflow adaptive. ## Pipeline ### Stage 0 — Preflight analysis Compute: * brightness histogram * saturation * edge density * noise estimate * face stats * sky coverage * CLIP scores ### Stage 1 — Policy selection Decide: * ESRGAN mode * CodeFormer usage * grading blend * sky enhance on/off * depth mode ### Stage 2 — Enhancement Run only selected stages. ### Stage 3 — Output + metadata Include: * policy decisions * confidence scores * timings --- # Example metadata (v2) ```json { "workflow_version": "photo-enhance-v2", "analysis": { "scene_scores": { "landscape": 0.51, "golden_hour": 0.28 }, "face_count": 1, "sky_coverage_pct": 23.4 }, "policy": { "esrgan_mode": "weak", "depth_mode": "sharpen_only" } } ``` --- # Roadmap ## Phase 1 * conditional ESRGAN * blended scene grading * disable background blur default * preserve metadata ## Phase 2 * preflight analysis * gating logic for faces and sky * improved logging ## Phase 3 * better color handling (true sRGB) * noise-aware detail * improved saturation logic --- # Final verdict ## Strengths * strong architecture * practical workflow * good separation of concerns ## Weakness * over-processing risk from always-on stages ## Key improvement Move from fixed pipeline → adaptive pipeline This will improve both quality and performance significantly.