Speech Note: App for Offline Speech to Text, TTS, and Translation

I’ve always dreamed of talking to computers. Cloud tools made it possible, but nothing beats doing it right on your machine. The first one I encountered that ran well locally was Dragon Speak. It worked pretty well but was very expensive and closed.

In recent years, few models can run locally and do well. OpenAI Whisper took it to the next level. However, it wouldn’t run on my LC230. Then there were derivatives like Whisper.CPP, which offered tiny models with modest accuracy even on my LC230. That changed drastically with the arrival of a new laptop that can run medium-sized models with reasonably good accuracy.

Speech Note: Quick Access from system tray.

So now I run Speech Note (dsnote) all the time. Speech Note is a Linux app that can be used for writing, reading, and translating with offline Speech to Text, Text to Speech, and Machine translation models. It supports hundreds of models you can download and run from the app. It’s an all-in-one app that is packaged well.

Speech to Text

Speech Note: All available English STT models.

It has many models available (including any of your custom models). Since I run it on a laptop without a dedicated GPU, picking a model always means balancing GPU load, speed, and quality. Here is my table.

ModelSizeOptimized ForIndian Accent SupportSpeedAccuracy
FasterWhisper MediumMediumSpeed + AccuracyGeneralFastHigh
FasterWhisper SmallSmallSpeedGeneralVery FastMedium
FasterWhisper TinyTinySpeedGeneralFastestLow
Coqui HugeHugeAccuracyUnknownSlowHigh
Vosk Large (Indian)LargeIndian AccentYesMediumHigh
Vosk Small (Indian)SmallIndian AccentYesFastMedium
WhisperCpp BaseBaseGeneralLimitedMediumMedium
WhisperCpp-Distil Large-v2LargeBalancedModerateFastHigh
WhisperCpp-Distil Large-v3LargeBalancedModerateFastHigh
WhisperCpp-Distil MediumMediumSpeedModerateFastMedium
WhisperCpp-Distil SmallSmallSpeedModerateVery FastLow
WhisperCpp Large-v3LargeAccuracyModerateSlowHigh
WhisperCpp Large-v2LargeAccuracyModerateSlowHigh
WhisperCpp Large-v3 TurboLargeHigh AccuracyModerateMediumVery High
WhisperCpp MediumMediumBalancedModerateFastMedium
WhisperCpp SmallSmallSpeedModerateVery FastLow
WhisperCpp TinyTinySpeedModerateFastestVery Low

Of course, you can download multiple models and choose a specific model at any point.

Speech Note: If you download multiple models, you can switch between them easily while using.

Speech Note: The default model that I use, its a good balance between speed and Accuracy

Text to Speech

Same with TTS, here I have gone with Piper librettr_s models. I like Kathleen’s voice. This is my default. It’s fast, clear and quite pleasant to listen to.

Speech Note: Text to Speech models.

Translation

My translations are mostly from German to English. So many OSM related articles are in German.

Speech Note: Translation Window.

Actions

Speech Note (dsnote) also supports calling actions through the command line. So, you configure your OS-level customization, automations and shortcuts. For example, I have Ctrl+Alt+L to start listening.

flatpak run net.mkiol.SpeechNote --action start-listening 

Supported actions

      Invokes an action

      Supported actions (@action_name) are:
       start-listening
       start-listening-translate
       start-listening-active-window
       start-listening-translate-active-window
       start-listening-clipboard
       start-listening-translate-clipboard
       stop-listening
       start-reading
       start-reading-clipboard
       start-reading-text *
       pause-resume-reading
       cancel
       switch-to-next-stt-model
       switch-to-prev-stt-model
       switch-to-next-tts-model
       switch-to-prev-tts-model
       set-stt-model *
       set-tts-model *

      * Optional 'argument' is used to pass model-id or
      text to read. To pass both, set 'argument' to
      "{model-id}text to read".

Settings

Speech Note: Enable HW acceleration if available.
Speech Note: Rules that you can use to fix regular errors in STT.

Need help

I’m still looking for good FOSS options for Kannada TTS, Indian English STT voices, and English↔Kannada translation models. Send your recommendations my way.


You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Join 2,146 other subscribers

2 Responses

  1. Jeff McNeill says:

    Thanks for this info. I’ve been meaning to get back into Natural Language Processing stuff. My earlier days was dealing with DeepSpeech and Tesseract, and also Mycroft. It’s gonna take a little while for me to follow up all the threads to see which projects are still viable, which have viable forks, and which have largely been replaced. The requirement for local processing for any of these machine learning / translation components is paramount. The energy consumed by AI is unconscionable, not to mention the squandering of trillions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.