Introduction
We were tasked with supporting a multilingual project aimed at enhancing speech-to-text systems through the collection of realistic, context-rich SMS messages. The goal was to provide machine translation models with authentic and diverse data reflecting users’ everyday language. The main challenge was to generate coherent, believable content aligned with given scenarios, while maintaining linguistic consistency and quality across multiple languages
Solution and Benefits
- Creation of realistic SMS messages by native speakers
- Use of client-provided scenarios and contexts to simulate real-life conversations
- Generation of new SMS messages or replies to predefined ones to mimic natural interactions
- Assurance of linguistic and semantic diversity in the content generated
- Enforcement of specific constraints: minimum 40 characters and a maximum of 30 words per message
- Collection of authentic data essential for training more accurate and natural speech-to-text systems
Results and Conclusions
- Project completed in just over two months
- Involvement of four languages with significant volumes of data collected
- Concrete contribution to improving voice recognition and machine translation systems using real-world data