BERT encoding is very slow - Help

Question

Zzz 2021-5-7

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/823815-bert-encoding-is-very-slow-help

回答： Ralf Elsas 2023-2-26

I've been following this github: https://github.com/matlab-deep-learning/transformer-models which is the MATLAB implementation of BERT.

While trying to encode my text using the tokenizer, following this script, I realize that BERT encoding takes very long to work on my dataset.

My dataset contains 1000+ text entries, each of which is ~1000 in length. I noticed that the example csv used in the github contains very short description text. My question is: how can we perform text preprocessing using BERT encoding? And how we can speed up the encoding process?

Thanks!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Divya Gaddipati 2021-5-13

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/823815-bert-encoding-is-very-slow-help#answer_698938

Here are a few things that you can try to speed up the tokenizer, which were suggested by the GitHub repo author (you can also find this information here):

1. Remove redundant white-space tokenization in BasicTokenizer

2. Convert basic tokenized tokens to UTF32 in one call in FullTokenizer, and modify WordPieceTokenizer to accept UTF32 as input.

3. Only call sub.string() once in WordPieceTokenizer.

4. Remove input validation in WhitespaceTokenizer which may be called many times.

If the issue still exists, you could also create a new issue on the GitHub page itself.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Ralf Elsas 2023-2-26

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/823815-bert-encoding-is-very-slow-help#answer_1180280

Hello! For everybody dealing with this issue - it can be easily solved: fastBERTtokens

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

BERT encoding is very slow - Help

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

BERT encoding is very slow - Help

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论