Google today introduced TensorFlow.Text, a library for preprocessing language models with TensorFlow. The open source machine learning framework created by the Google Brain team has seen more than 41 million downloads.
TensorFlow.Text can be installed using PIP and comes with the ability to utilize tokens to break apart and analyze text like words, numbers, and punctuation.
At launch, TensorFlow.Text can recognize white space, unicode script, and predetermined sequences of word fragments like suffixes or prefixes that Google calls wordpieces. Wordpieces are commonly used in approaches like BERT, a pretraining technique for language models Google open-sourced last fall.
The library also comes with ops for normalization, n-grams, and sequence constraints for labeling, according to a Medium post announcing the news.
TensorFlow.Text’s tokenizers use RaggedTensors, a new kind of tensor made for recognizing text. RaggedTensors and Unicode support for TensorFlow were first detailed by Google engineer Mark Omernick earlier this year at the TensorFlow Dev Summit.
The news comes just days after the beta release of TensorFlow 2.0. The latest version of Google’s open source framework was released in alpha in March at the TensorFlow Dev Summit. TensorFlow 2.0 uses fewer APIs, deeper Keras integration, and improvements to runtime for Eager Execution.
TensorFlow.Text is the latest dedicated library introduced by Google in the past few months to help people accomplish specific tasks with machine learning. TensorFlow Graphics was released last month and is designed to bring more deep learning to graphics and 3D models.
Perhaps the most popular is TensorFlow Lite for embedded devices, which is now used on more than 2 billion devices, Google said earlier this year. Google uses TensorFlow Lite to power things like speech detection on GBoard and edge detection in Google Photos.