← All articles
February 27, 2026 4 min read

How I Fine-Tuned F5-TTS for Romanian — and Why It Was Harder Than Expected

Romanian has almost no open-source TTS options. I spent three months fixing that — collecting data, hitting GPU walls, and eventually publishing the first F5-TTS model fine-tuned on Romanian speech.

Romanian has almost no open-source TTS options. I spent three months fixing that — collecting data, hitting GPU walls, and eventually publishing the first F5-TTS model fine-tuned on Romanian speech. Here’s what actually happened.

Why Romanian TTS is a Problem

If you search for a decent Romanian text-to-speech model, you’ll find two things: commercial APIs with usage limits, and a handful of academic models that haven’t been updated since 2019. For a language with 24 million speakers, that’s a strange gap.

I needed a TTS solution for a client project — nothing fancy, just something that could generate natural-sounding Romanian voice narration for short video content. After two weeks of evaluating what existed, I decided it would be faster to fine-tune something myself than to work around the limitations of what was available.

Picking the Right Base Model

F5-TTS was the obvious choice. It’s built on a flow-matching architecture that handles prosody well for languages it wasn’t originally trained on, and it has a relatively clean fine-tuning path compared to alternatives like XTTS or Bark. The base model is trained on English, Chinese, and a few other languages — Romanian isn’t included, but the phoneme coverage overlaps enough to give the model a reasonable starting point.

The alternative was XTTS v2, which already supports Romanian. But XTTS produces robotic output on shorter sentences, and the voice cloning quality was inconsistent. F5-TTS with a proper fine-tune produces noticeably more natural results.

The Dataset Problem

This is where things got difficult. Fine-tuning TTS requires high-quality, single-speaker audio with accurate transcriptions. For English, you can pull from LibriSpeech or Common Voice and have tens of thousands of hours. For Romanian, the Common Voice dataset exists but the quality is inconsistent — crowd-sourced recordings, varied microphones, background noise.

I ended up building a custom dataset by:

  • Recording a native Romanian speaker across multiple sessions (approximately 4 hours total)
  • Using Whisper large-v3 to generate initial transcriptions, then manually correcting errors
  • Normalizing audio to -23 LUFS, trimming silence, splitting on sentence boundaries
  • Supplementing with cleaned segments from Romanian audiobooks in the public domain

Final dataset: around 6.5 hours of clean, aligned audio. That’s on the low end for TTS fine-tuning — most recommendations suggest 10+ hours — but F5-TTS is data-efficient enough that it worked.

GPU Memory: The Wall I Kept Hitting

My local setup runs a single RTX 3090 (24GB VRAM). F5-TTS fine-tuning at batch size 16 with the full model fits, barely — but any mistake in the training script causes OOM errors that waste hours. I ended up moving the longer training runs to RunPod, using an A100 80GB pod on demand.

The hybrid approach: iterate and test locally, run full training jobs on cloud. Total cloud spend for the project was under €40.

Training Details

The fine-tune ran for approximately 50,000 steps on the combined dataset. Key parameters that made a difference:

  • Lower learning rate than the default — 1e-5 instead of 1e-4 reduced catastrophic forgetting of the base model’s prosody patterns
  • Mixed precision (bf16) — stable on A100, occasionally problematic on 3090 where fp16 was safer
  • Checkpoint evaluation every 5,000 steps — the model hit a quality ceiling around step 35,000 and improvement after that was marginal

Results and Publication

The final model handles Romanian speech with natural prosody, correct stress patterns, and consistent voice quality across different sentence types. It handles diacritics correctly (ă, â, î, ș, ț) — something that trips up generic multilingual models.

I published it on Hugging Face as the first publicly available F5-TTS model fine-tuned specifically for Romanian. The response from the Romanian developer community was better than expected — it filled a real gap.

If you’re working on a Romanian language project that needs TTS, the model is available on Hugging Face. It’s free to use under the same license as the base F5-TTS model.

Interested in working together? Get in touch → ← All articles