An in-depth look at how language barriers limit conversational AI in video security systems, and how multilingual NLP and speech-to-text platforms address the challenge.
Imagine a meeting between a security systems integrator and a client. The integrator suggests a video security solution by company A, saying it has the best natural-language AI in the market. The client can search hours of footage using voice commands, such as, for example, “show me all people carrying a black backpack entering the building this morning.” The system understands and shows the coveted footage within seconds—but only if you talk to it in English.
The client, however, is from a country where English is not widely spoken, and neither is Chinese, Spanish, French or any other “global” language. For them, the value added by natural-language AI might be very limited.
It is a mismatch, to say the least, that developers of natural-language AI models for security systems are from only a handful of countries—the US, China, Korea—and develop conversational AI in and for their own language in the first place, while the global security workforce is from all over the world, speaking many more languages than the developers might be thinking of.
No longer lost in translation – conversational AI for niche languages
Lithuanian company Neurotechnology recently introduced an AI platform for natural-language processing tasks, seeking to improve greatly the usability of natural-language AI for speakers of Lithuanian, as well as closely related Latvian, and Estonian. Those three languages—each spoken by less than 5 million people—were the obvious choice for the company, but its goals go far beyond that.
“Our goal is to add more languages,” Vytas Mulevičius, NLP Team Lead at Neurotechnology, told asmag in an exclusive interview. “Smaller European languages are especially interesting for us. There is significant demand, but they are not in the focus of the big players in the industry. But also beyond Europe, if it’s a niche language that’s nevertheless important, it’s interesting for us.”
The company’s newly introduced, cloud-based AI platform works as text-to-speech and speech-to-text tool that is available as an API or through a web interface. It seeks to help organizations automate language-related tasks such as transcription and voice synthesis. For security operators, it is most interesting as a tool that can optimize natural-language search tasks related to security footage.
As a biometrics company that has won many NIST awards for its facial recognition technology, Neurotechnology is focused on highest accuracy, and the existing offerings for niche languages were simply not good enough—which means on par with text-to-speech models for English, Japanese, Korean, etc.
How does natural-language search work?
To understand how Neurotechnology’s new tool works, we need to zoom in on how large-language models integrated in video security systems retrieve footage. The focus is primarily on LLMs developed by security camera manufacturers and VMS providers.
Going back to our initial example, let’s focus on this search query: “Show me all people carrying a black backpack entering the building this morning.”
In the first step, commonly called Natural Language Understanding, or NLU, the LLM identifies key words and attributes, essentially translating the raw input into a machine-understandable prompt. “This morning” becomes 6am to 12pm, and “entering the building” narrows the focus down to footage captured by the cameras near entrances.
Meanwhile, “black backpack” is matched with footage by means of multimodal bridging. This doesn’t mean the AI is quickly “rewatching” six hours of footage, but that it matches “black backpack” with metadata created by the system as the footage was captured, bringing up all instances of when it identified a black backpack. Instead of using static tags (“black” and “backpack”), modern AI systems use vector indexes and similarity thresholds, enabling them to recognize that a “dark-blue backpack” or a “black messenger bag” might be meant as well, while a “red suitcase” is almost certainly not what the user is searching for.
However, if the initial prompt is given in Lithuanian (or maybe even Hindi or Arabic), the sequence might stall even before the NLU stage. Before “understanding” can begin, the system has to identify phonemes, match them into words and match words into sentences. Here, too, modern AI uses its knowledge about human speech and the probability of phoneme and word sequences to produce an accurate transcription.
Needless to say, “juoda kuprinė,” spoken by Lithuanian security staff, has to be transcribed before it can become “black backpack” so the system can find the footage. Many mass-market LLMs, however, already fail to do the initial task accurately.
Multi-language training for AI
Neurotechnology identified this pain point: LLM have issues “matching” spoken and written word in lesser spoken languages. This is simply because training data—that is real, high-quality voice recordings—are lacking in training datasets.
“About 60% of training data of even the best open-source models are English,” Mulevičius said. “All other languages combined make up the remaining 40%. And languages like Lithuanian are usually below 0.01% each.”
This leads to a significant gap in quality, which is most obvious in text-to-speech applications like voice synthesis.
“When you talk to ChatGPT or Google Gemini in English, it’s almost impossible to tell it’s a robot replying. This shows that the amount and quality of training data play a large role,” Mulevičius said. “When chatbots speak to you in Lithuanian, however, you can immediately tell something’s off. They speak, for example, in a thick Russian or English accent.”
The same issue persists on the “machine understanding” side. The risk of a misunderstanding is simply bigger if the operator speaks in a lesser-spoken language, especially if their accent is not narrowly matched in the training data.
The step from spoken word to written word in the same language “is indeed the main pain point of existing systems,” Mulevičius said. “Once that task is done accurately, however, translating from one language to the other and generating legitimate text in languages like Lithuanian via AI is not the problem.”
The reason for developing a platform for three languages simultaneously—Lithuanian, Latvian and Estonian—was not just about adding another market for Neurotechnology’s tool.
“Adding another language actually improves AI performance in the original language as well,” Mulevičius said. “Existing research suggested that adding Latvian and Estonia on top of a model developed primarily for Lithuanian improves quality, and our solution shows that it really does.”
For an AI that is meant to understand, individual languages cannot be treated as siloed entities, especially lesser-spoken ones.
“Lithuanian, for example, has changed relatively little at its core over the past centuries,” Mulevičius said. “At the same time, its speakers—and thereby the language itself—integrate a lot of vocabulary from other languages, such as English. Take, for example, a Lithuanian programmer who speaks about ‘pushing a code for a new algorithm’ in Lithuanian. He will still use the words ‘push’ and ‘code’ in English. Sometimes English vocabulary is changed slightly to adapt to the rules of Lithuanian grammar. A natural-language AI model must be able to process all this. And most importantly, it must be actively taught how to understand.”
In contrast to popular belief, AI cannot improve and expand autonomously at the highest level of sophistication. This is especially true when it comes to adding new languages and integrating new developments in usage.
“You always have to come back and analyze the errors,” Mulevičius said, adding that error rates might spike if AI training is based only on synthetic data, gradually moving away from the real-life scenarios the AI is meant to work in.
Conclusion – an ongoing challenge
For integrators of security systems, this poses a challenge. Making natural-language search functions work sometimes needs addons and patches, such as SDKs and frequent updates. Even if users are operating in languages that the systems are native in, providing future-ready solutions is not just a question of one-time installation—it is an ongoing task of continuous refinement.
Product Adopted:Software