fastBERTtokens: Tokenizing for BERT in parallel

版本 1.0.0 (1.4 KB) 作者: Ralf Elsas
This function simply divides your text into batches, and tokenizes in parallel. Provides significant speed-up.
18.0 次下载
更新时间 2023/2/24

查看许可证

Function to use Matlab BERT tokenizer in parallel
This function simply divides your text into batches, and tokenizes in parallel. As the Matlab tokenizer is very slow when run on a single processor for large data, this provides a significant speed-up. On an i7-10875H laptop with 8 logical units, tokenizing 76k sentences takes about 100 seconds.
Also note that providing the Matlab BERT model is important, as different BERT models use different encodings for the special BERT tokens like [SEP] etc.

引用格式

Ralf Elsas (2024). fastBERTtokens: Tokenizing for BERT in parallel (https://www.mathworks.com/matlabcentral/fileexchange/125295-fastberttokens-tokenizing-for-bert-in-parallel), MATLAB Central File Exchange. 检索来源 .

MATLAB 版本兼容性
创建方式 R2022b
与 R2021a 及更高版本兼容
平台兼容性
Windows macOS Linux
致谢

参考作品: Transformer Models

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
版本 已发布 发行说明
1.0.0