Upload an audio or video file and get a clean text transcript back, plus timestamped subtitle files (SRT, VTT) and synced-lyrics files (LRC). It works on podcasts, interviews, lectures, meetings, voice memos, and the vocals of your own music. For music tracks there's a dedicated mode that isolates the vocals first, so the words come through far more clearly than feeding a full mix to a transcriber.

The transcription runs on a paid GPU and is funded by a couple of short ads, and you only watch ads for the portion of the file you choose to transcribe, not the whole thing.

How to use it

  1. Click the upload area or drag and drop an audio or video file (MP3, WAV, OGG, FLAC, M4A, WebM, MP4; up to 50 MB).
  2. Pick the mode. Speech / talking transcribes the file as-is (up to 10 minutes); Song / music isolates the vocals first (up to 6 minutes, which costs a few more ads because of the extra step).
  3. If the file is longer than the per-run limit, drag the green and red markers to pick the section you want. The "−1 s / −10 s / +1 s / +10 s" buttons and Preview let you home in on it.
  4. Optionally set the spoken/sung language (or leave it on Auto-detect), tick "Translate the result to English", or open Advanced options to add a context hint (names, jargon, spelling) and toggle the low-confidence-line filter.
  5. Press Transcribe, watch the short ad(s), and your transcript appears.
  6. Toggle Show timestamps to switch between flowing text and a timestamped line list, Copy the text, or download it as .txt, .srt, .vtt, or .lrc. In Song mode you also get the isolated vocals to download or send to another tool.

FAQ

What's the difference between Speech mode and Song mode? Speech mode feeds your selection straight to the speech-to-text model. It's best for talking: podcasts, interviews, lectures, voice notes. Song mode first separates the vocal track from the music and transcribes only the vocals, which gives much cleaner results on full songs. Song mode does an extra GPU step, so it's capped at a shorter length and uses a few more short ads.

Which output formats do I get? A plain-text transcript (.txt), SubRip subtitles (.srt), WebVTT subtitles (.vtt), and an LRC file (.lrc) for synced lyrics. They're all built from the same timestamped result, so you can use whichever your video editor, player, or karaoke app expects.

How accurate is it, and what affects accuracy? It uses a state-of-the-art Whisper model. Clear recordings, single speakers, and common languages transcribe best. Heavy background noise, overlapping speakers, strong accents, or low-bitrate audio reduce accuracy. For songs, use Song mode. Adding a context hint in Advanced options (names, technical terms, expected spelling) can noticeably improve proper nouns.

What happens if the audio has no talking or singing? The tool detects that and tells you "No speech or vocals were detected" instead of returning made-up text. The "Drop low-confidence / non-speech lines" option (on by default) also removes the spurious lines that speech models sometimes produce over silence or pure instrumental passages.

Can it detect the language? Can it translate? Yes. Leave the language on Auto-detect and it'll figure it out (the detected language is shown with the result). You can also pick the language manually, and tick "Translate the result to English" to get an English version alongside the original.

Why are there ads, and is there a daily limit? Transcription (and vocal isolation for songs) runs on rented GPU time, which costs real money. A short ad per few minutes of your selection keeps the tool free. To prevent abuse there's a cap on how much you can transcribe per day; if you hit it, you'll see a message and can come back later.

Do you keep my audio or my transcript? No. Your file is processed for this request only and isn't stored long-term, and your transcript is returned to you. It isn't published, indexed, or added to any database. Please only upload audio you have the rights to transcribe.