"SIRENA"
"Sirena" AI music video by Mark DK Berry.
WORKFLOWS & PROJECT DETAILS
🎥 Description
"In Latin, "sirena" translates to siren. The word originates from the Greek word Σειρήν (Seirḗn), which refers to the mythical creatures known for their enchanting voices and ability to lure sailors to their doom."
🧜♀️ About the Project
"Sirena" was my seventh AI music video, and for this one I deliberately stepped out of my comfort zone to tackle something different: an underwater romance.
My main goal was to improve image and animation quality across the board. Unfortunately, despite giving myself extra time and effort, I didn’t quite reach the level I’d hoped for. While hardware played a part, character consistency and learning all the workflow settings were the real bottlenecks.
⚠️ Key Challenges
-
Character Consistency
A nightmare. I trained Flux Loras, which were decent, but unless it’s a full-face image, they don’t stay consistent in video creation.
Face swapping? A black art. Gave up after days of fiddling.
Wan 2.1 Lora training: Took two days to get working and produced poor results, and it slowed my machine so much it became unusable. I'm holding out hope for keyframing and video inpainting (like VACE) as future solutions.
-
Legs, Hair, Body Shape
Characters often morphed or warped, even from behind. Three shots had bad leg warping that I didn’t have time to fix.
-
Clothing Consistency
I put real effort into this, but many outfits didn’t stick once they became video.
-
Image Tiling Artifacts
Only spotted these late in production. The culprit? Flux or SDXL during upscaling. I patched some of it using blur tools in Krita, but didn’t have time to redo every clip.
-
Psychological Toll
Once I got past the 8–10 day mark, I was chasing diminishing returns. At 18 days, I was questioning whether I was going insane tweaking pixels no one would notice or care about.
⏱️ Time & Energy Investment
→ Total: 18 working days (and some nights running batch renders, and a number of additional days installing, re-installing, fixing broken installs, workflows, nodes, etc.)
I didn't properly track electricity use this time, but I think it was around 40 KWhs. I will track it in the future. It's possible there will come a moment where it is cheaper and faster to rent powerful servers and batch process everything, than to run a GPU locally for days and nights burning KWhs.
💻 Hardware
All of it was done on a regular home PC.
🧰 Software Stack
-
ComfyUI (Flux, Wan 2.1, inpainting models)
-
Krita + ACLY plugin – Fast inpainting and upscaling
-
Topaz – Used only for 16fps to 120fps interpolation (not enhancement)
-
Reaper DAW – Storyboarding with shot names and timecode burned into MP4
-
Davinci Resolve 19 – Final cut and colour grade
-
LibreOffice – Tracking shot names, prompts, colour themes, fixes, etc.
🎨 Loras Used & Trained
-
Trained Flux Loras for both the fisherman and mermaid (10 images each; ~3 hours per Lora)
-
Trained Wan 2.1 Lora on WSL2 with help from The Art Official — ultimately not usable on my setup. It should work but is complex, I will have to research it better for next time.
-
FLux Loras from Civitai used in the video (see the video for their "looks"):
📺 Resolution & Rendering Details
I need to work on this for the next project, it didn't go according to plan
-
Flux output: 1344x768 (upscaled x2, then downscaled in Wan 2.1)
-
Tested Wan 2.1 resolutions extensively:
-
Interpolated all clips to 120fps via Topaz (In retrospect it was pointless. Davinci Resolve free version allows up to 60fps)
-
Tried to bring back an "analog" vibe in final color grading in DR
😵💫 Final Thoughts
The biggest hurdle was still character consistency. I trained Loras, tested face-swapping, tried everything, but nothing quite nailed it. Underwater scenes and low-res footage made things harder.
Prompting and camera direction was another headache. Wan 2.1 is better than Hunyuan, but not exactly "obedient." I tried short prompts, long prompts, "3-sentence" tricks with mixed results.
By the end, I was feeling frustrated. I had hoped for more photorealism and tighter characters. Instead, the video still felt cartoonish (though that was partly intentional). I haven't fully mastered this — or maybe the tech just isn't quite there yet.
There were many challenges and frustrations, by the end it was more about getting it finished and learning from the experience.
back to top