Abstract

Recent advances in using text descriptions as prompts to guide speech synthesis tasks have garnered significant attention. Text- Prompt-Based Text-To-Speech (TTS) systems require users to pro- vide fine-grained style descriptions as much as possible to achieve optimal speech synthesis control. However, due to varying levels of user expressiveness, some inadequate style descriptions may negatively impact the quality of synthesized speech. Compared to fine-grained style descriptions, allowing users to provide a coarse- grained style description significantly lowers the usage barrier and enhances the overall user experience. In this paper, we propose Auto-Enrichment Text-To-Speech (AE-TTS), a two-stage pipeline that automatically converts user-provided coarse-grained style de- scriptions into enriched fine-grained ones. Specifically, the first stage infers the contextual information of the input text based on the coarse-grained style description, automatically extracting implicit emotional tendencies, scene characteristics, and thematic features. In the second stage, guided by the coarse-grained style description and the inferred contextual information, the system generates fine-grained style descriptions, making them more de- tailed and expressive. Finally, we adopt autoregressive generation paradigm to guide speech synthesis with given context and refined style descriptions as prompts.The entire process is powered by large language models (LLMs). Through both subjective and objective evaluations, we successfully demonstrate the effectiveness of our approach in controllable expressive speech generation tasks.

Methodology

Overview of the AE-TTS architecture. Left: A two-stage LLM-driven pipeline converts coarse-grained style descriptions into fine-grained ones via context extrapolation and prompt enrichment. Right: The enriched descriptions and contextual cues guide a controllable TTS model for expressive speech synthesis.

The figure below shows AE-TTS architecture.



Automatic Prompt Enrichment

Our method enables the LLM to infer implicit information and intent from the input text’s context, dynamically adjusting the style and content of synthesized speech. Specifically, we use the LLM to generate Extrapolated Con- text surrounding sentences that encompass the topic and scenario of the Input Text. Then, a Coarse-grained Style Description serves as initial guidance, which, when combined with Extrapolated Con- text, helps generate Fine-grained Style Description with greater nuance and expressiveness. Finally, we pass the Extrapolated Con- text and the Input Text through the Context Compression module to obtain the Paragraph-level Semantic Prompt, which is combined with the Fine-grained Style Description to guide the TTS model to output expressive speech that meets expectations.

Text Previous Text Following Text Coarse-grained Description Fine-grained Description
I think you’re really talented. (我觉得你真的很有才华。) In the just-concluded project report, your train of thought was exceptionally clear. || When breaking down complex issues, you also demonstrated a unique perspective. || When encountering difficult problems before, your solutions were always so ingenious. || The part you were responsible for this time was completed quite excellently. || Everyone should learn from your spirit of dedication. I hope you can share more work skills in the future. || I believe you can create more value in future projects. || Our team is really lucky to have a member like you. || I'm looking forward to seeing your wonderful performance next time. || Perhaps you can take the lead in some important tasks later. The adult male uses a mezzo and low volume, with a sarcastic tone. In the context of privately questioning the other person, an adult male expresses his opinion of the other person’s talent in a sarcastic, playful emotional tone in a medium pitch, medium speed, and low volume. Emphasis on the word "really."
What a nice day it is! (今天的天气真好!) Look at the sunlight, it's warm when it shines on the body. || The gentle breeze blows softly, bringing a trace of coolness. || The sky is blue and there isn't a single cloud. || The birds are singing merrily on the branches. || Such weather is really suitable for going out for a walk. We can find a lawn and sit down to bask in the sun. || Maybe we can see beautiful butterflies later. || I really want to enjoy such nice weather for a while longer. || How about we go for a walk by the lake and feel the shimmering waves of the lake water? || Today is really a good day to relax and unwind. Young women express their enjoyment of the beautiful weather with a cheerful high pitched voice and a fast pace. In an outdoor environment, a young woman expresses her love for the weather with a cheerful, relaxed emotional tone in a high-pitched, fast and loud voice. Emphasize the word "nice" and highlight the pleasant mood.
You seem very busy, Am I bothering you? (你似乎很忙,我没打扰你吧?) I see that you've been working busily in front of the computer. || You've been making phone calls one after another. || I just saw you walking in a hurry with a pile of documents in your arms. || I feel that you have a heavy workload today. || If you're busy, I can wait until you're free to chat. Do you need me to help you with some of the work? || No matter how busy you are at work, you should also pay attention to taking a rest. || I hope you can arrange your time reasonably and don't tire yourself out too much. || If there's anything I can help with, just let me know. || After you're done with this busy period, let's have a good get-together. A young female with a gentle high tone and moderate speaking speed. In the context of caring about the other person's work status, a young female, with a high tone, medium speaking speed and medium volume, and a gentle and considerate emotional tone, expresses her concern. Emphasize the word "bothering" to reflect her consideration."
I wish you could come to the party. (真希望你能来参加派对。) We have been preparing for this party for a long time, and the programs are extremely wonderful. || Everyone is really looking forward to spending a happy time with you. || The venue is beautifully decorated, and the atmosphere will definitely be great. || We have prepared a lot of delicious food, all of which are your favorites. || Without you, this party would be like losing its backbone. Then we can play games together and laugh heartily. || I believe you will leave beautiful memories if you come. || Don't hesitate anymore. Hurry up and promise me to come to the party. || I can't wait to see you at the party. || Everyone is looking forward to your arrival. You must come! A young female expresses with high tone and fast speaking speed. In the context of sincerely inviting a friend, a young female, with a high tone, fast speaking speed and large volume, and an enthusiastic and expectant emotional tone, extends the invitation. Emphasize the word "wish you" to highlight the expectation.
You can always find a solution! (你总是能找到解决办法!) You easily resolved such a thorny problem last time. || When everyone else was stuck, only you had a clear train of thought. || Your keen insight into problems is truly admirable. || Every time there's a difficult problem, you're the first person we think of. || Your experience and wisdom always play a crucial role. Keep it up, and you'll definitely be able to solve more difficult problems in the future. || With you in the team, we're never afraid of difficulties. || I hope you can share more of your problem-solving ideas. || I believe you'll create more miracles in the future. || It's definitely right to turn to you when there's a problem. An adult male expresses with a trusting medium tone and medium speaking speed. In the context of affirming someone's ability to handle troubles, an adult male, with a medium tone, speaking speed and volume, and a trusting and respectful emotional tone, affirms the other person. Emphasize the word "always" to strengthen the degree of trust."


Comparison of AE-TTS and Baseline systems

Model Text Style Prompt Audio
PromptTTS Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts. I want a low pitched female voice.
PromptTTS Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts. This madam talks to me with a deep voice.
PromptTTS Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts. Decrease the pitch of her voice for me.
Salle Racism has no place in any sport. The despair woman's voice resonated slowly, her miserable energy remaining low, pitch high.
Salle One even gave my little dog a biscuit. A boy said in a desperate voice.
Salle A doctor believes this boy to be mad. Rapidly speaking, the despair man's deep voice resonates with a sense of normal energy.
VoxInstruct Oh God, save me! (上帝呀,救我吧!) The voice is tearful, filled with intense sorrow, somewhat warm yet slightly pleading, and somewhat resolute.
VoxInstruct Good if you understand, child, put away the donation book, and go wash the horse! Your eldest senior brother is coming soon, right? (明白就好,孩子,收好捐册,牵马去洗吧!你大师兄就快来了?) Old Lu issued instructions in a tone full of deep affection, with a slight earnest plea, calm yet with a certain expectation. He gently said "Understand", and his tone became heavier when mentioning "eldest senior brother".
VoxInstruct All these make me very happy; this is true happiness. These are my memories, and this is my life! (这些都让我非常高兴,这是真正的幸福。这是我的回忆,又是我的生活!) Filled with joy and a profound sense of happiness, he expresses himself very warmly and contentedly, without hesitation stating that Lina lives with happy memories. His tone slows slightly when he describes "true".
CosyVoice His bad jokes, though cliched, kept everyone laughing. (他讲的冷笑话虽然老套,但仍然让大家笑个不停。) A female speaker with normal pitch and normal speaking rate.
CosyVoice The beauty of life lies not in the grand moments, but in the simple, everyday wonders that we often overlook. (生活的美不在于宏大的时刻,而在于那些我们经常忽视的简单而日常的奇迹。) A male speaker with low pitch, fast speaking rate, and angry emotion.
CosyVoice When we leave this world, people will remember us not for the possessions we accumulated, but for the impact we had on their lives and the love we shared. (当我们离开这个世界时,人们记住的不是我们积累的财物,而是我们对他们生活的影响和我们共享的爱。) A female speaker with normal pitch, slow speaking rate, and sad emotion.
AE-TTS I think you’re really talented. (我觉得你真的很有才华。) In the context of privately questioning the other person, an adult male expresses his opinion of the other person’s talent in a sarcastic, playful emotional tone in a medium pitch, medium speed, and low volume. Emphasis on the word "really."
AE-TTS What a nice day it is! (今天的天气真好!) In an outdoor environment, a young woman expresses her love for the weather with a cheerful, relaxed emotional tone in a high-pitched, fast and loud voice. Emphasize the word "nice" and highlight the pleasant mood.
AE-TTS This dish is really good. (这道菜真不错。) In the context of praising the chef's craftsmanship in a restaurant, an adult male, with a medium tone, speaking speed and volume, and a happy and satisfied emotional tone, expresses his recognition of the dish. Emphasize the word "really good" to show his satisfaction.",


Ablation Results in different settings of AE-TTS

We conduct thorough ablation experiments to validate the contri- butions of different settings of AE-TTS. Where X represents Input Text, 𝐷𝑐 and 𝐷𝑓 represent Coarse-grained and Fine-grained Style Descriptions respectively, 𝐶𝑒𝑥𝑡 represent Extrapolated Context, 𝑇𝑆 represent Speech Tokens. The details are as follows:

The original AE-TTS

Setting Text Coarse-grained Description Audio
Ab.0: Full AE-TTS pipeline: The complete system with all components enabled, including context extrapolation, style enrichment, and context compression. Serves as the upper-bound reference. You always know people. (你总是很了解人。) In the context of praising the other person's high emotional intelligence, a young female, with a high tone, fast speaking speed and large volume, and a praising and admiring emotional tone, expresses her evaluation of the other person. Emphasize the phrase "always" to show the degree of her admiration.",

Caption Generation

Setting Text Audio
Ab.1: w/o 𝐷𝑐 when generating 𝐶𝑒𝑥𝑡 : To verify whether 𝐷𝑐 contribute to emotional consistency and contextual construction ability in 𝐶𝑒𝑥𝑡 . You always know people. (你总是很了解人。)
Ab.2: w/o X when generating 𝐶𝑒𝑥𝑡 : Testing whether reasonable 𝐶𝑒𝑥𝑡 can be generated despite the lack of actual X. You always know people. (你总是很了解人。)
Ab.3: w/o 𝐷𝑐 when generating 𝐷𝑓 : Verify that 𝐷𝑐 provide basic user intent when generating 𝐷𝑓 , preventing expressive bias. You always know people. (你总是很了解人。)
Ab.4: w/o 𝐶𝑒𝑥𝑡 when generating 𝐷𝑓 : Verify whether 𝐶𝑒𝑥𝑡 can provide scenario support for 𝐷𝑓 to enhance naturalness and context consistency. You always know people. (你总是很了解人。)
Ab.5: w/o X and 𝐶𝑒𝑥𝑡 when generating 𝐷𝑓 : To measure the degradation in expressiveness when 𝐷𝑓 is generated solely based on 𝐷𝑐 . You always know people. (你总是很了解人。)

Speech Synthesis

Setting Text Audio
Ab.6: replacement of 𝐷𝑓 by 𝐷𝑐 when generation 𝑇𝑆 : To directly compare the performance of using 𝐷𝑐 versus 𝐷𝑓 during speech synthesis. You always know people. (你总是很了解人。)
Ab.7: only X and 𝐷𝑐 when generation 𝑇𝑆 : To assess the performance of the system in the absence of both 𝐶𝑒𝑥𝑡 and 𝐷𝑓 . You always know people. (你总是很了解人。)
Ab.8: w/o 𝐶𝑒𝑥𝑡 when generation 𝑇𝑆 : To test whether removing 𝐶𝑒𝑥𝑡 at the generation stage affects the expressiveness and fidelity of synthesized speech. You always know people. (你总是很了解人。)
Ab.9: w/o Context Compression: To evaluate the effectiveness of the Context Compression module in summarizing 𝐶𝑒𝑥𝑡 . You always know people. (你总是很了解人。)


Analysis of Extrapolated Context Length

To evaluate the impact of extrapolated context length on style controllability, we vary the number of surrounding sentences 𝐾 ∈ {1, 2, 3, 4, 5} during context extrapolation.

Text Context Length Fine-grained Description Audio
Lang Ping couldn't help crying after the match."(赛后的郎平忍不住哭了。) K=1 In an indoor sports venue, An adult female speaks with alto tone, medium loudness, medium speed speech, emotional characteristics of excitement revealed. Emphasizing "crying" to show the overwhelming feelings that burst out at this moment.
K=2
K=3
K=4
K=5
Seventy-six thousand five thousand seven hundred ninety-six.."(七十六万五千七百九十六。) K=1 A female speaker reads out the number in a low-pitched, small-volume and rapid manner. Her tone is tinged with regret, as if the number represents something that causes her a sense of loss or disappointment.
K=2
K=3
K=4
K=5

Fairness Comparison

There is rigorous side-by-side comparison with other systems using identical input text and prompts.

Text Fine-grained Description Baseline Audio
Boycotting Japanese goods is nothing compared to boycotting stupid people.(抵制日货真不如抵制蠢货。) In the context of a discussion about boycotting Japanese goods, a young boy angrily expresses his opinion in a loud voice and at a fast pace. He emphasizes the "stupid people" to enhance the emotions of anger and disdain, indicating that he believes it is more important to boycott certain people than to boycott Japanese goods. PromptTTS
Salle
VoxInstruct
CosyVoice
AE-TTS
The most satisfying thing is someone like you who listens so obediently!(最满意的就是你这样听话的人!) In the scenario of evaluating others' performance, a young man expresses his satisfaction with an obedient person in an admiring way, with a low volume and a fast speed. He emphasizes the phrase "Most satisfied" to highlight his high recognition of the obedient person and shows his eagerness through the fast expression. PromptTTS
Salle
VoxInstruct
CosyVoice
AE-TTS
The usually very polite Old Zhang unexpectedly refused his request.(平时十分客气的老张,竟然拒绝了他的要求。) In the context of communicating daily events, a young man calmly tells a story in a low, hoarse voice, with a slow and reserved speed. The politeness in his voice lacks tenderness. He emphasizes the word "unexpectedly" to show his surprise at the fact that Lao Zhang, who is usually very polite, refused someone's request. PromptTTS
Salle
VoxInstruct
CosyVoice
AE-TTS

In addition, we conducted an evaluation on the long audio samples.

Model Text Style Prompt Audio
AE-TTS 还记得小时候,我们在田野里追逐萤火虫,月光洒在稻草堆上,那些无忧无虑的日子,真是一去不复返了。 In the scenario of recalling past times with others, a middle-aged man tells his childhood experiences in a low, steady tone, at a slow speed and with a moderate volume, with a touch of nostalgia and sentiment. He emphasizes the scenes like "Chasing fireflies in the fields" and "The moonlight shining on the straw stacks" to create a beautiful memory picture and highlight his regret that the carefree childhood days are gone forever.
AE-TTS 哇!你看这款新发布的折叠屏手机,不仅屏幕超大超清晰,还能自由变形,简直太酷太神奇了! When seeing the newly released foldable screen phone, a 25-year-old woman expresses her excitement and surprise in a high pitch, at a fast speed and with a large volume. She emphasizes exclamatory words like "Wow" and "so cool and amazing" to highlight her amazement at the phone's large and clear screen and its ability to transform freely, emphasizing her love and admiration for this phone.
AE-TTS 亲爱的,别再为那些琐事烦恼了,不管遇到什么困难,我都会一直陪在你身边,我们一起面对。 In the scenario where his partner is worried about trivial matters, a young man comforts his partner in a gentle and soothing tone, at a moderate speed and with a soft volume. He emphasizes the word "dear" to convey firm companionship and support, expressing affection and comfort.
AE-TTS 真不敢相信,辛苦准备了这么久的项目,马上就要在明天的重要会议上展示了,好紧张啊! On the eve of a project's presentation at an important meeting, an adult woman expresses complex emotions in a slightly high pitch, at a slightly fast speed and with a normal volume. She emphasizes the phrase "so nervous" to show her emphasis and dedication to the project, as well as the mixed feelings of nervousness and anticipation for the presentation.
AE-TTS 你怎么能这样!明明说好了一起处理这个事情,现在却把烂摊子全丢给我,自己拍拍屁股走人,太不负责任了! In the scenario of being abandoned by a partner in a cooperative matter, a young man expresses his anger and dissatisfaction in a sharp and fluctuating pitch, at an extremely fast speed and with a volume almost like a roar. He emphasizes the phrases "How could you do this", "leave it all to me" and "so irresponsible" to highlight his strong condemnation of his partner's behavior of breaking the agreement and being irresponsible. The tremor in his voice further reflects the depth of his anger.