Google has recently introduced a new multilingual text vectorizer called RETVec (an acronym for Resilient and Efficient Text Vectorizer), to aid identification of potentially malicious content like spam and fraudulent emails in Gmail.
While massive platforms like YouTube and Gmail use text classification models to identify frauds, offensive remarks, and phishing attempts, threat actors are known to create counter-strategies to get around these security mechanisms.
The project description on GitHub reads, “RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more.”
“The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently.”
The Google-sponsored platforms reveal that they have been using Adversarial text manipulations, such as the usage of homoglyphs, keyword stuffing, and invisible characters.
With its out-of-the-box support for over 100 languages, RETVec seeks to contribute to developing more robust and computationally affordable server-side and on-device text classifiers that are more durable and effective.
In natural language processing (NLP), vectorization is a technique that maps words or phrases from a lexicon to a matching numerical representation for use in sentiment analysis, text classification, and named entity recognition, among other analyses.
This article has been indexed from CySecurity News – Latest Information Security and Hacking Incidents