Voice recognition has become increasingly important with the emergence of software-defined vehicles. In-vehicle voice recognition systems allow drivers to interact with their cars through spoken commands, making it possible to control cars without taking their eyes off the road. Recent advancements in deep learning have enabled voice recognition systems to process natural language. However, drivers often use short and truncated words when using these systems, making it challenging to understand their intentions and translate them into specific infotainment or driving-related features. As such, interpreting the driver's intent solely based on a single utterance remains a daunting task. To address the challenge of accurately understanding a driver's intention, we propose a context-enriched intent classifier. Unlike conventional classifiers that rely solely on the driver's utterances, our approach takes into account the wider context and environment surrounding the driver. Specifically, our approach combines the driver's short and incomplete utterance with contextual information and feeds this input into a pre-trained Large Language Model (LLM) such as GPT-3.5. The LLM then generates a complete and context-enriched text that can be easily translated into specific in-vehicle functions by the conventional intent classifier. Finally, the intent classifier takes the generated text as an input and enables the car to operate the driver-indented in-vehicle functions with ease. Despite being a deep learning model trained for in-vehicle intent classification, the original intent classifier still struggles when the input sentence is unexpectedly short due to the lack of context. Therefore, we leverage the ability of LLM trained on an amount of general data to understand common sense and use contextual information to infer the driver's intent. Our study demonstrates significant improvement in intent classification compared to conventional classifiers that rely on a single utterance. Our approach, in particular, performs well when the driver's utterance is short and unclear. Additionally, we found that this method can be personalised to individual driver preferences. However, it is crucial to note that excessive contextual information can act as noise, and hence, controlling the amount of contextual information to avoid overreacting is a limitation. In conclusion, we propose a context-enriched intent classifier for in-vehicle voice recognition systems. By combining short utterances with contextual information using a pre-trained large language model, the classifier can accurately interpret the driver's intentions. Furthermore, our research has revealed that this approach has significant potential for customisation, allowing for a more user-friendly interface that can be tailored to individual driver preferences.
Ms. SongEun Lee, Research Engineer, Hyundai Motor Group