Ask HN: How to automate aesthetic photo cropping? (CV/AI)
icons Monday, January 12, 2026Hi everyone,
I am a backend developer currently engineering an in-house automation tool for a K-pop merchandise production company (photocards, postcards, etc.).
I have built an MVP using Python (FastAPI) + Libvips + InsightFace to automate the process where designers previously had to manually crop thousands of high-resolution photos using Illustrator.
While basic face detection and image quality preservation (CMYK conversion, etc.) are successful, I am hitting a bottleneck in automating the "Designer's Sense (Vibe/Aesthetics)."
[Current Stack & Workflow]
Tech Stack: Python 3.11, FastAPI, Libvips (Processing), InsightFace (Landmark Detection).
Workflow: Bulk Upload $\rightarrow$ Landmark Extraction (InsightFace) $\rightarrow$ Auto-crop based on pre-defined ratios $\rightarrow$ Human-in-the-loop fine-tuning via Web UI.
[The Challenges]
Mechanical Logic vs. Aesthetic Crop
Simple centering logic fails to capture the "perfect shot" for K-pop idols who often have dynamic poses or varying camera angles.
Issue: Even if the landmarks are mathematically centered, the resulting headroom is often inconsistent, or the chin is awkwardly cut off. The output lacks visual stability compared to a human designer's work.
Need for Reference-Based One-Shot Style Transfer
Clients often provide a single "Guide Image" and ask, "Crop the rest of the 5,000 photos with this specific feel." (e.g., a tight face-filling close-up vs. a spacious upper-body shot).
Goal: Instead of designers manually guessing the ratio, I want the AI to reverse-engineer the composition (face-to-canvas ratio, relative position) from that one sample image and apply it dynamically to the rest of the batch.
[Questions]
Q1. Direction for Improving Aesthetic Composition
Is it more practical to refine Rule-based Heuristics (e.g., fixing eye position to the top 30% with complex conditionals), or should I look into "Aesthetic Quality Assessment (AQA)" or "Saliency Detection" models to score and select the best crop?
As of 2026, what is the most efficient, production-ready approach for this?
Q2. One-Shot Composition Transfer
Are there any known algorithms or libraries that can extract the "compositional style" (relative position of eyes/nose/mouth regarding the canvas frame) from a single reference image and apply it to target images?
I am looking for keywords or papers related to "One-shot learning for layout/composition" or "Content-aware cropping based on reference."
Any keywords, papers, or architectural advice from those who have tackled similar problems in production would be greatly appreciated.
Thanks in advance.