README ¶
gosling
Natural sounding text-to-speech in the terminal (and more).
Pre-requisites
This is NOT intended to be a completely-free, pick-up-and-use TTS solution. In fact, it is simply a wrapper around Google's Cloud Text-to-Speech API.
You will need:
- A GCP account with billing enabled.
- Google gives you 1 million characters free every month. That's nearly 10 books a month. It's essentially free for personal use.
- Once you have a GCP account, enable the TTS API and get a service account.
- Export service account credentials in your shell. You will need to do this every time you open a new shell. Add it to your shell configuration or make a script to run
gosling
for convenience.export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
- Internet connection every time you need some text spoken to you.
- I have only tested this on Linux. Commands for playing audio will be different on other platforms.
Examples
Simple text with default options
Numbers and punctuation with default options
(the multiple exclamations are something that I have seen other TTSs struggle with):
Welcome to gosling!!! It has options such as "Pitch adjustment" in the range -20.0 to 20.0, "Speaking rate/speed" in the range 0.25 to 4.0 and "Volume gain" (in dB) in the range -96.0 to 16.0.
Other languages
Kannada:
Check out the full voice list, use Wavenet
or Neural2
based voices for better quality.
Installation
Pre-built binaries
Go to the latest release, scroll down to "Assets" and download the correct file for your platform. Unzip the file and run the gosling
binary inside:
./gosling
If you have go
installed
go install github.com/Samyak2/gosling@latest
Usage
Text file
gosling input.txt output.mp3
Play the resulting output.mp3
file using your audio player.
Standard input
echo "hello there" | gosling - output.mp3
Play audio directly
If you have the play
command, which is usually a part of the sox
package (sudo dnf install sox
on Fedora):
echo "hello there" | gosling - - | play -t mp3 -
If you have the ffplay
command, which is a part of ffmpeg
:
echo "hello there" | gosling - - | ffplay -nodisp -autoexit -
Options
gosling
has a lot of configuration around language & voice, audio, etc.
See gosling --help
for all options.
Usage: gosling <input-file> <output-file>
Arguments:
<input-file> Text file to read from. Use - for standard input.
<output-file> Audio file to write to. Use - for standard output.
Flags:
-h, --help Show context-sensitive help.
-l, --language-code="en-US" Language code to use for the synthesis. See full list at: https://cloud.google.com/text-to-speech/docs/voices
-v, --voice-name="en-US-Wavenet-A" Voice name to use for the synthesis. Use an empty string to let the GCP API choose. See full list at: https://cloud.google.com/text-to-speech/docs/voices
--pitch=-3 Pitch adjustment in the range [-20.0, 20.0]. Use a negative number to decrease the pitch. See:
https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
-r, --speaking-rate=1.0 Speaking rate/speed in the range [0.25, 4.0]. See: https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
--volume-gain=0.0 Volume gain (in dB) in the range [-96.0, 16.0]. See: https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
-s, --[no-]ssml Use if text has SSML. Default is plain text. See: https://cloud.google.com/text-to-speech/docs/basics#speech_synthesis_markup_language_ssml_support
--service-endpoint=STRING GCP Service Endpoint. You'll need to set this if you want a Neural2 voice. See: https://cloud.google.com/text-to-speech/docs/endpoints.
FAQ
The voice sounds too robotic
WaveNet
By default, on the default language, gosling
uses a WaveNet based voice model. If you're using a different language, make sure to switch the voice to a WaveNet based one too. Use --voice-name
for this.
Neural2
If WaveNet is not good enough, try using a Neural2
voice type (search for Neural2
in the voice list if you need other languages):
gosling input.txt output.mp3 --service-endpoint 'https://us-central1-texttospeech.googleapis.com' -v en-US-Neural2-A
TODO: this endpoint is currently timing out for all TTS requests, not sure why.
If Neural2 isn't good enough either, well... you'll have to take this up with Google.
Why am I getting this error google: could not find default credentials
?
Either:
- You did not read the Pre-requisites section.
- You forgot to export the
GOOGLE_APPLICATION_CREDENTIALS
environment variable in your shell. - Something is wrong with your GCP service account. See this page that is also linked from the error.
Why don't --pitch
and --volume-gain
have short versions?
These options can have negative values and the command-line parser I use behaves weirdly with negative numbers and short flags. I have removed the short versions to avoid making it a pitfall.
How do I use this with foliate
?
I use this script:
#!/bin/bash
# requires gosling and sox
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
gosling - - | play -t mp3 - &
trap 'kill $!; exit 0' INT
wait
Copy and save this to a file and chmod +x /path/to/foliate-gosling.sh
it.
TODO: this only works with English text. I need to figure out a way to convert FOLIATE_TTS_LANG_LOWER
to Google's format.
But why?
When I'm too lazy to read an article, I use Google Assistant's "read me this article" feature on my phone. It's extremely good, especially with text-only articles. I could not find an alternative on desktop (specifically, Linux).
Yes, there are quite a few text-to-speech apps on Linux. Most of them either sound like R2D2 or something from the depths of the void. The only one, that I found, which sounds bearable uses an undocumented Google Translate API (probably a ToS violation?). There are also some pre-trained neural-network based models, but they sound like a person speaking through a very low-bandwidth voice call and they skip over numbers and abbreviations pretending they never existed.
The only text-to-speech that sounded good was Google's. So I thought - "they must have a GCP API for this". And they did. And I hacked this together.
TODO
-
speech-dispatcher
support. This will allow using it in Firefox's reader mode, for example. - Some pre-processing of raw text - remove extra/unnecessary punctuation, better formatting for numbers, etc.
License
MIT