Riconoscimento vocale

Avvertimento

Il riconoscimento vocale non funziona nella versione 21.04.2 a causa di alcuni problemi con l”API Vosk. Usa la versione 21.04.1, oppure la 21.04.3 o le successive.

Before you can use Speech to Text, it must be properly configured and speech models installed. Please refer to the chapter Configure Speech to Text.

Suggerimento

While you can configure and set up both, VOSK and Whisper, for speech recognition, the engine that is selected in the Speech to Text configuration section is being used for speech recognition the next time you use this feature. You can switch back and forth during editing, of course, and use different engines for different purposes. The Speech Editor widget has a menu entry to quickly access the configuration section bypassing the Menu ‣ Settings ‣ Configure Kdenlive ‣ Speech to Text route.

Speech Recognition

There are two use cases for speech recognition:

  1. Creating subtitles automatically

  2. Creating transcripts and the ability to add clips to the timeline based on the transcript

Creating Subtitles using VOSK Speech Recognition

If not yet created, add a subtitle track by clicking on the add-subtitleEdit Subtitle Tool icon in the Timeline tool bar (6).

Speech to text subtitle

Automatic subtitle generation using the VOSK engine

1:

tools-wizardSpeech recognition. Click here to open the Automatic Subtitling dialog window.

2:

Timeline Zone. More details about Timeline Zones can be found in the chapter Righello della linea temporale.

3:

Choose which part of the timeline should be used for speech recognition

4:

Process. Click to start the recognition

5:

Model. Select the model for the language of the subtitles. You can install more models in the Configuration section Speech to Text.

6:

add-subtitleEdit Subtitle Tool. Click to open or close the subtitle track.

Steps to create subtitles using VOSK speech recognition

(numbers in brackets point to the GUI element in the screenshot above):

  1. tools-wizardSpeech recognition (1). Click here to open the Automatic Subtitling dialog window.

  2. If needed, define a timeline zone (2) for which you want to use speech recognition. More details about Timeline Zones can be found in the chapter Righello della linea temporale.

  3. Model (5). Select the model for the language of the subtitles. You can install more models in the Configuration section Speech to Text.

  4. Choose which part of the timeline should be used for speech recognition (3)

  5. Process (4). Click to start the subtitle creation.

The subtitle is created and inserted automatically.

Remark to step 4: The default is to analyze only the Timeline zone (all tracks) (2 in the screenshot above). Set the timeline zone to what you want to analyze (use I and O to set in and out points). Selected clips option analyses the selected clip only.

Creating Subtitles using WHISPER Speech Recognition

If not yet created, add a subtitle track by clicking on the add-subtitleEdit Subtitle Tool icon in the Timeline tool bar (11).

Speech to text subtitle Whisper

Automatic subtitle generation using the Whisper engine

1:

tools-wizardSpeech recognition. Click here to open the Automatic Subtitling dialog window.

2:

Timeline Zone. More details about Timeline Zones can be found in the chapter Righello della linea temporale.

3:

Choose which part of the timeline should be used for speech recognition

4:

Model. Select the model for the language of the subtitles. You can install more models in the Configuration section Speech to Text.

5:

Process. Click to start the recognition

6:

Language. Default is Autodetect. Change to the correct language if not detected properly.

7:

Maximum character per line. Define how many characters per line are allowed before a line break is inserted.

8:

Translate with SeamlessM4T. Checking this opens adds two more selection fields: One for the Input language, and one for the Output language. This requires that translation with SeamlessM4T is enabled in the settings (Menu ‣ Settings ‣ Configure Kdenlive ‣ Speech To Text). Please refer to the chapter about Speech to Text.

9:

Translate to English. Select this to use Whisper for the translation to English.

10:

add-subtitleEdit Subtitle Tool. Click to open or close the subtitle track.

Steps to create subtitles using VOSK speech recognition

(numbers in brackets point to the GUI element in the screenshot above):

  1. tools-wizardSpeech recognition (1). Click here to open the Automatic Subtitling dialog window.

  2. If needed, define a timeline zone (2) for which you want to use speech recognition. More details about Timeline Zones can be found in the chapter Righello della linea temporale.

  3. Model (5). Select the model for the language of the subtitles. You can install more models in the Configuration section Speech to Text.

  4. Choose which part of the timeline should be used for speech recognition (3)

  5. Process (4). Click to start the subtitle creation.

The subtitle is created and inserted automatically.

Remark to step 4: The default is to analyze only the Timeline zone (all tracks) (2 in the screenshot above). Set the timeline zone to what you want to analyze (use I and O to set in and out points). Selected clips option analyses the selected clip only.

Translate with SeamlessM4T

Whisper SeamlessM4T: Choose input and output language

Translating with SeamlessM4T

Select Input Language and Output Language and click Process.

This will first process the audio using Whisper, then start the SeamlessM4T translation. Translation can occupy 100% RAM, 100% CPU and 100% disk access.

Attenzione

Se il modello di circa 9GB non è statoi ancora scaricato, lo sarà adesso. Con una velocità di scaricamento di 100MB/s ci vorranno circa 12 minuti.

During download Kdenlive will react as normal. Do not click on Close, otherwise the download is stopped.

Whisper SeamlessM4T choose input and output language

Don’t worry if you see such a message on the box below Initializing translation model while the download is running.

Depending on your internet connection and bandwidth, downloading the model can take quite some time (about 12 minutes with 100MB/s download speed).

Once the translation model is downloaded, translation will start.

Creating Clips using Speech Recognition

This is useful for interviews and other speech-related footage. Go to the Speech Editor widget. If not yet enabled, do so via Menu ‣ View ‣ Speech Editor.

Nota

Using speech recognition to create transcripts and create clips from that, is only possible with clips in the Project Bin.

Speech editor

Shown with the VOSK engine and Search (10) enabled

Seleziona una clip nel contenitore del progetto.

1:

If needed, set in and out points in the Clip Monitor and check Selected zone only. This will only transcribe text inside that zone.

2:

Click on application-menuHamburger Menu and choose the model for the correct language when the VOSK engine is set for speech recognition. If the Whisper engine is selected, you can select Translate to English if needed. You select the speech recognition engine in Menu ‣ Settings ‣ Configure Kdenlive ‣ Speech to Text. Click on Configure Speech Recognition to open the configuration section for Speech to Text. For more details about the configuration refer to the chapter Configure Speech to Text.

3:

Press the Transcribe button.

4:

Seleziona il testo che vuoi. Tieni premuto CTRL o Maiusc per selezionarne più di uno.

5:

Create new sequence with edit creates a new sequence with each timecode-text as a single clip. Insert selection in timeline creates clips for each selected timecode-text starting at the playhead’s position. Save edited text in a playlist file creates an asset in the project bin with the entire transcribed text.

6:

format-font-size-moreIncrease font size and format-font-size-lessDecrease font size decrease, respectively, increase the font size.

7:

bookmark-newAdd marker adds a marker/guide for the timecode of the selected text. More details about Guides and Markers are available in the chapter about Guide.

8:

edit-deleteDelete selection deletes the selected text.

9:

Remove non speech zones deletes all «No speech» entries at once.

10:

edit-findSearch in text toggles the search field. Enter text you want to find in the transcribed text. Search is not case sensitive and finds all occurrences of the string even within words. go-up and go-down navigate to the next occurrence of the search term. If the search field turns reddish you have reached the last occurrence of the search term in the text.

Silence Detection

Nota

Funziona solo col motore VOSK.

Select the clip in the Project Bin and open the speech editor window (Menu ‣ View ‣ Speech Editor) .

Click on application-menuHamburger Menu and choose the model for your language. If the right model is not listed, click on Configure Speech Recognition. For details about how to add models for the VOSK engine refer to the chapter about Speech to Text.

Fai quindi clic sul pulsante Inizia il riconoscimento.

Una volta che è terminato vai al punto 6 qui sopra e scegli Rimuovi le zone senza parlato, che le rimuove tutte in un colpo solo. Diversamente fai clic sul codice temporale dove è indicato «Nessuna voce» (tieni premuto Ctrl per selezionarne più di una per volta), quindi premi semplicemente il tasto Canc.

Ripeti l’operazione con tutte le parti che vuoi rimuovere, incluse quelle in cui qualcuno dice qualcosa che non vuoi includere nel montaggio finale.

Quando hai terminato, assicurati che Seleziona solo la zona sia disabilitato, poi fai clic sul pulsante Salva il testo modificato in un file di scaletta (alla fine del punto 5) e dopo pochi secondi verrà aggiunta nel contenitore del progetto una nuova scaletta: questa sarà senza silenzi e senza il testo nei punti in cui non lo vuoi.