erasePunctuation

Erase punctuation from text and documents

Syntax

newStr = erasePunctuation(str)

newDocuments = erasePunctuation(documents)

newDocuments = erasePunctuation(documents,'TokenTypes',types)

Description

newStr = erasePunctuation(str) erases punctuation and symbols from the elements of str. The function removes characters that belong to the Unicode punctuation or symbol classes.

example

newDocuments = erasePunctuation(documents) erases punctuation and symbols from documents. If a word is empty after removing punctuation and symbol characters, then the function removes it. For tokenized document input, the function erases punctuation from tokens with type 'punctuation' and 'other'. For example, the function does not erase punctuation and symbol characters from URLs and email addresses.

example

newDocuments = erasePunctuation(documents,'TokenTypes',types) erases punctuation and symbols from only the specified token types.

example

Examples

collapse all

Erase Punctuation from Text

Open Live Script

Erase the punctuation from the text in str.

str = "it's one and/or two.";
newStr = erasePunctuation(str)

newStr = 
"its one andor two"

To insert a space where the "/" symbol is, first use the replace function.

newStr = replace(str,"/"," ")

newStr = 
"it's one and or two."

newStr = erasePunctuation(newStr)

newStr = 
"its one and or two"

Erase Punctuation from Documents

Open Live Script

Erase the punctuation from an array of documents.

documents = tokenizedDocument([ ...
    "An example of a short sentence." 
    "Another example... with a URL: https://www.mathworks.com"])

documents = 
  2×1 tokenizedDocument:

     7 tokens: An example of a short sentence .
    10 tokens: Another example . . . with a URL : https://www.mathworks.com

newDocuments = erasePunctuation(documents)

newDocuments = 
  2×1 tokenizedDocument:

    6 tokens: An example of a short sentence
    6 tokens: Another example with a URL https://www.mathworks.com

Here, the function does not erase the punctuation symbols from the URL.

Input Arguments

collapse all

`str` — Input text
string array | character vector | cell array of character vectors

Input text, specified as a string array, character vector, or cell array of character vectors.

Example: ["An example of a short sentence."; "A second short sentence."]

Data Types: string | char | cell

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

`types` — Token types to erase punctuation from
`{'punctuation','other'}` (default) | string array | character vector | cell array of character vectors

Token types to erase punctuation from, specified as a character vector, string array, or a cell array of character vectors containing one or more token types (including custom token types).

The tokenizedDocument and addTypeDetails functions automatically detect the following token types:

letters — string of letter characters only
digits — string of digits only
punctuation — string of punctuation and symbol characters only
email-address — detected email address
web-address — detected web address
hashtag — detected hashtag (starts with "#" character followed by a letter)
at-mention — detected at-mention (starts with "@" character, followed by 1 to 15 ASCII letter, digit, or underscore characters)
emoticon — detected emoticon
emoji — detected emoji
other — does not belong to the previous types and is not a custom type

To specify your own custom token types when tokenizing, use the 'CustomTokens' or 'RegularExpressions' options in tokenizedDocument. If you do not specify a type for a custom token, then the software sets the corresponding token type to 'custom'.

Data Types: string | char | cell

Output Arguments

collapse all

`newStr` — Output text
string array | character vector | cell array of character vectors

Output text, returned as a string array, character vector, or cell array of character vectors. str and newStr have the same data type.

`newDocuments` — Output documents
`tokenizedDocument` array

Output documents, returned as a tokenizedDocument array.

More About

collapse all

Unicode Character Categories

Each Unicode character is assigned a category. The following table summarizes the Unicode punctuation and symbol categories and provides an example character from each category:

Category	Category Code	Number of Characters	Example Character
Punctuation, Connector	[Pc]	10	_
Punctuation, Dash	[Pd]	24	-
Punctuation, Close	[Pe]	73	)
Punctuation, Final quote	[Pf]	10	”
Punctuation, Initial quote	[Pi]	12	“
Punctuation, Other	[Po]	566	!
Punctuation, Open	[Ps]	75	(
Symbol, Currency	[Sc]	54	$
Symbol, Modifier	[Sk]	121	^
Symbol, Math	[Sm]	948	+
Symbol, Other	[So]	5855	¦

For more information, see [1].

Tips

For string input, erasePunctuation removes punctuation characters from URLs and HTML tags. This behavior can prevent the functions eraseTags, eraseURLs, and decodeHTMLEntities from working as expected. If you want to use these functions to preprocess your text, then use these functions before using erasePunctuation.

References

[1] Unicode Character Categories. https://www.fileformat.info/info/unicode/category/index.htm

Version History

Introduced in R2017b

expand all

R2018b: `erasePunctuation` skips complex tokens

Starting in R2018b, for tokenizedDocument input, erasePunctuation, by default, erases punctuation and symbol characters from tokens with type 'punctuation' or 'other' only. This behavior prevents the function from affecting complex tokens such as URLs and email-addresses.

In previous versions, erasePunctuation erases punctuation characters from all tokens. To reproduce the behavior, use the 'TokenTypes' name-value pair.

erasePunctuation

Syntax

Description

Examples

Erase Punctuation from Text

Erase Punctuation from Documents

Input Arguments

`str` — Input text
string array | character vector | cell array of character vectors

`documents` — Input documents
`tokenizedDocument` array

`types` — Token types to erase punctuation from
`{'punctuation','other'}` (default) | string array | character vector | cell array of character vectors

Output Arguments

`newStr` — Output text
string array | character vector | cell array of character vectors

`newDocuments` — Output documents
`tokenizedDocument` array

More About

Unicode Character Categories

Tips

References

Version History

R2018b: `erasePunctuation` skips complex tokens

See Also

Topics

erasePunctuation

Syntax

Description

Examples

Erase Punctuation from Text

Erase Punctuation from Documents

Input Arguments

str — Input text string array | character vector | cell array of character vectors

documents — Input documents tokenizedDocument array

types — Token types to erase punctuation from {'punctuation','other'} (default) | string array | character vector | cell array of character vectors

Output Arguments

newStr — Output text string array | character vector | cell array of character vectors

newDocuments — Output documents tokenizedDocument array

More About

Unicode Character Categories

Tips

References

Version History

R2018b: erasePunctuation skips complex tokens

See Also

Topics

`str` — Input text
string array | character vector | cell array of character vectors

`documents` — Input documents
`tokenizedDocument` array

`types` — Token types to erase punctuation from
`{'punctuation','other'}` (default) | string array | character vector | cell array of character vectors

`newStr` — Output text
string array | character vector | cell array of character vectors

`newDocuments` — Output documents
`tokenizedDocument` array

R2018b: `erasePunctuation` skips complex tokens