categorical

Array that contains values assigned to categories

Description

categorical is a data type that assigns values to a finite set of discrete categories, such as High, Med, and Low. These categories can have a mathematical ordering that you specify, such as High > Med > Low, but it is not required. A categorical array provides efficient storage and convenient manipulation of nonnumeric data, while also maintaining meaningful names for the values. A common use of categorical arrays is to define groups of rows in a table.

Creation

To create a categorical array:

Use the categorical function as described below.
Bin continuous data using the discretize function. Return the bins as a categorical array.
Multiply two categorical arrays. The product is a categorical array whose categories are all possible combinations of the categories of the two operands.

Syntax

B = categorical(A)

B = categorical(A,valueset)

B = categorical(A,valueset,catnames)

B = categorical(A,___,Name=Value)

Description

B = categorical(A) creates a categorical array from the input array. The categories of the output array are the sorted unique values from the input array.

example

B = categorical(A,valueset) creates one category for each value in valueset. The categories of B are in the same order as the values of valueset.

You can use valueset to include categories for values not present in A. Conversely, if A contains any values not present in valueset, then the corresponding elements of B are undefined.

example

B = categorical(A,valueset,catnames) names categories by matching the category values in valueset to the corresponding names in catnames.

example

B = categorical(A,___,Name=Value) specifies options using one or more name-value arguments in addition to the input arguments in previous syntaxes. For example, to indicate that the categories have a mathematical ordering, set Ordinal to true.

example

Input Arguments

expand all

`A` — Input array
numeric array | logical array | categorical array | datetime array | duration array | string array | cell array of character vectors

Input array, specified as a numeric array, logical array, categorical array, datetime array, duration array, string array, or cell array of character vectors.

The categorical function removes leading and trailing spaces from input values that are strings or character vectors.

If the input A contains missing values, then the corresponding element of the output array is undefined and displays as <undefined>. The categorical function converts the following values to undefined categorical values:

NaN in numeric and duration arrays
The missing string (<missing>) or the empty string ("") in string arrays
The empty character vector ('') in cell arrays of character vectors
NaT in datetime arrays
Undefined values (<undefined>) in categorical arrays

The output array does not have a category for undefined values. To create an explicit category for missing or undefined values, you must include the desired category name in catnames, and a missing value as the corresponding value in valueset.

The input A also can be an array of objects with the following class methods:

unique
eq

`valueset` — Categories
`unique(A)` (default) | vector of unique values

Categories, specified as a vector of unique values. The data type of valueset and the data type of the input array must be the same, except when the input is a string array. In that case, valueset can be either a string array or a cell array of character vectors.

The categorical function removes leading and trailing spaces from elements of valueset that are strings or character vectors.

`catnames` — Category names
string array | cell array of character vectors

Category names, specified as a string array or a cell array of character vectors. If you do not specify the catnames input argument, then categorical uses the values in valueset as category names.

The category names cannot include a missing string (<missing>), an empty string (""), or an empty character vector ('').

To merge multiple distinct values from the input array into a single category in the output array, include duplicate names corresponding to those values.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: categorical(A,Ordinal=true) specifies that the categories have a mathematical ordering.

`Ordinal` — Ordinal variable flag
`0` or `false` (default) | `1` or `true`

Ordinal variable flag, specified as a numeric or logical 0 (false) or 1 (true).

0 (false)

categorical creates a categorical array that is not ordinal, which is the default behavior.

The categories of the output array have no mathematical ordering. Therefore, you can compare the values in the output for equality only. You cannot compare the values using any other relational operator.

1 (true)

categorical creates an ordinal categorical array.

The categories of the output array have a mathematical ordering, such that the first category specified is the smallest and the last category is the largest. You can compare the values in the output using relational operators, such as less than and greater than, in addition to comparing the values for equality. You also can use the min and max functions on an ordinal categorical array.

For more information, see Ordinal Categorical Arrays.

`Protected` — Protected categories flag
`0` or `false` | `1` or `true`

Protected categories flag, specified as a numeric or logical 0 (false) or 1 (true).

The categories of ordinal categorical arrays are always protected. If you set Ordinal to true, then the default value of Protected is also true. Otherwise, the default value of Protected is false.

`0` (`false`)	When you assign new values to the output array, the categories update automatically. Therefore, you can combine (nonordinal) categorical arrays that have different categories. The categories can update accordingly to include the categories from both arrays.
`1` (`true`)	When you assign new values to the output array, the values must belong to one of the existing categories. Therefore, you can only combine arrays that have the same categories. To add new categories to the output, you must use the function `addcats`.

Examples

collapse all

Create Categorical Array and Analyze Data by Category

Open Live Script

Create a categorical array from a list of weather station codes. Then add it to a table of temperature readings. Use the categorical array to help you analyze the data in the table by category.

First, create an array of weather station codes.

Stations = ["S1" "S2" "S1" "S3" "S2"]

Stations = 1x5 string
    "S1"    "S2"    "S1"    "S3"    "S2"

To create a categorical array from the weather station codes, use the categorical function.

Stations = categorical(Stations)

Stations = 1x5 categorical
     S1      S2      S1      S3      S2

Display the categories. The three station codes are the categories.

categories(Stations)

ans = 3x1 cell
    {'S1'}
    {'S2'}
    {'S3'}

Now create a table that contains weather data. The table has temperatures, dates, and station codes.

Temperatures = [58;72;56;90;76];
Dates = datetime(["2017-04-17";"2017-04-18";"2017-04-30";"2017-05-01";"2017-04-27"]);
Stations = Stations';
tempReadings = table(Temperatures,Dates,Stations)

tempReadings=5×3 table
    Temperatures       Dates       Stations
    ____________    ___________    ________

         58         17-Apr-2017       S1   
         72         18-Apr-2017       S2   
         56         30-Apr-2017       S1   
         90         01-May-2017       S3   
         76         27-Apr-2017       S2

Categorize the data in the table by weather station. For example, return table rows that have data for station S2. Index into the table using an array of logical indices where Stations equals S2.

TF = (tempReadings.Stations == "S2")

TF = 5x1 logical array

   0
   1
   0
   0
   1

tempReadings(TF,:)

ans=2×3 table
    Temperatures       Dates       Stations
    ____________    ___________    ________

         72         18-Apr-2017       S2   
         76         27-Apr-2017       S2

To find patterns in the data associated with weather stations, make a scatter plot of temperature readings by station.

scatter(tempReadings,"Stations","Temperatures","filled")

Figure contains an axes object. The axes object with xlabel Stations, ylabel Temperatures contains an object of type scatter.

Specify Categories Not Present in Input Array

Open Live Script

Convert a string array to a categorical array. Specify that the categorical array has a set of categories that includes a value that is not present in the original array.

First, create a string array that has a set of repeated values.

A = ["red" "blue" "blue" "blue" "blue" "red"]

A = 1x6 string
    "red"    "blue"    "blue"    "blue"    "blue"    "red"

Convert the string array to a categorical array. Specify its categories. Include green as a category.

valueset = ["blue" "red" "green"];
B = categorical(A,valueset)

B = 1x6 categorical
     red      blue      blue      blue      blue      red

Display the categories of the categorical array. It has a category that did not come from the input string array.

categories(B)

ans = 3x1 cell
    {'blue' }
    {'red'  }
    {'green'}

Specify Category Names

Open Live Script

Create a numeric array.

A = [1 3 2; 2 1 3; 3 1 2]

A = 3×3

     1     3     2
     2     1     3
     3     1     2

Convert the numeric array to a categorical array. Specify the values and the names for the categories.

B = categorical(A,[1 2 3],["red" "green" "blue"])

B = 3x3 categorical
     red        blue      green 
     green      red       blue  
     blue       red       green

Display the categories.

categories(B)

ans = 3x1 cell
    {'red'  }
    {'green'}
    {'blue' }

B is not an ordinal categorical array. Therefore, you can compare the values in B only using the equality operators, == and ~=.

Find the elements that belong to the category red. Access those elements using logical indexing.

TF = (B == "red")

TF = 3x3 logical array

   1   0   0
   0   1   0
   0   1   0

B(TF)

ans = 3x1 categorical
     red 
     red 
     red

Specify Category for Missing and Empty Inputs

Open Live Script

By default, the categorical function converts missing values (such as NaNs, NaTs, empty strings, and missing strings) into undefined categorical values. However, when you call categorical you can specify a category for missing values to belong to.

For example, create a string array that includes an empty string and a missing string.

A = ["hi" "lo" missing "" "lo" "lo" "hi"]

A = 1x7 string
    "hi"    "lo"    <missing>    ""    "lo"    "lo"    "hi"

First, convert the string array to a categorical array with undefined elements.

C = categorical(A)

C = 1x7 categorical
     hi      lo      <undefined>      <undefined>      lo      lo      hi

categories(C)

ans = 2x1 cell
    {'hi'}
    {'lo'}

Then, convert it again. But this time specify INDEF as the category for missing strings.

C = categorical(A,["lo" "hi" missing],["lo" "hi" "INDEF"])

C = 1x7 categorical
     hi      lo      INDEF      <undefined>      lo      lo      hi

categories(C)

ans = 3x1 cell
    {'lo'   }
    {'hi'   }
    {'INDEF'}

Specify INDEF as the category for both missing and empty strings.

C = categorical(A,["lo" "hi" missing ""],["lo" "hi" "INDEF" "INDEF"])

C = 1x7 categorical
     hi      lo      INDEF      INDEF      lo      lo      hi

categories(C)

ans = 3x1 cell
    {'lo'   }
    {'hi'   }
    {'INDEF'}

Create Ordinal Categorical Array

Open Live Script

Create a 5-by-2 numeric array.

A = [3 2;3 3;3 2;2 1;3 2]

Convert A to an ordinal categorical array where 1, 2, and 3 represent the categories child, adult, and senior respectively.

valueset = [1 2 3];
catnames = ["child" "adult" "senior"];
B = categorical(A,valueset,catnames,Ordinal=true)

B = 5x2 categorical
     senior      adult  
     senior      senior 
     senior      adult  
     adult       child  
     senior      adult

Because B is ordinal, the categories of B have a mathematical ordering, child < adult < senior. You can use all relational operators with ordinal categorical values. For example, return the elements that have a value greater than adult.

TF = B > "adult"

TF = 5x2 logical array

   1   0
   1   1
   1   0
   0   0
   1   0

B(TF)

ans = 5x1 categorical
     senior 
     senior 
     senior 
     senior 
     senior

Preallocate Array and Initialize Categories

Open Live Script

You can preallocate a categorical array of any size by creating an array of NaNs and converting it to a categorical array. After you preallocate the array, you can initialize its categories by specifying category names and adding the categories to the array.

First create an array of NaNs. You can create an array having any size. For example, create a 2-by-4 array of NaNs.

A = NaN(2,4)

A = 2×4

   NaN   NaN   NaN   NaN
   NaN   NaN   NaN   NaN

Then preallocate a categorical array by converting the array of NaNs. The categorical function converts NaNs to undefined categorical values. Just as a NaN represents "not a number", <undefined> represents a categorical value that does not belong to a category.

A = categorical(A)

A = 2x4 categorical
     <undefined>      <undefined>      <undefined>      <undefined> 
     <undefined>      <undefined>      <undefined>      <undefined>

In fact, at this point A has no categories.

categories(A)

ans =

  0x0 empty cell array

To initialize the categories of A, specify category names and add them to A by using the addcats function. For example, add small, medium, and large as three categories of A.

A = addcats(A,["small" "medium" "large"])

A = 2x4 categorical
     <undefined>      <undefined>      <undefined>      <undefined> 
     <undefined>      <undefined>      <undefined>      <undefined>

While the elements of A are undefined values, the categories have been initialized by addcats.

categories(A)

ans = 3x1 cell
    {'small' }
    {'medium'}
    {'large' }

Now that A has categories, you can assign defined categorical values as elements of A.

A(1) = "medium";
A(8) = "small";
A(3:5) = "large"

A = 2x4 categorical
     medium           large      large            <undefined> 
     <undefined>      large      <undefined>      small

Bin Continuous Numeric Data into Categories

Open Live Script

The discretize function is recommended for creating categories out of continuous data, particularly when there are input values that are closely spaced. Two values are closely spaced when the difference between them is less than about 5e-5. When values are closely spaced, the categorical function cannot create unique category names from the values.

Create a numeric array with 100 random numbers.

X = rand(100,1)

To bin the numbers into three categories, use discretize. Specify bin boundaries and category names for the bins.

C = discretize(X,[0 .25 .75 1],"categorical",["small" "medium" "large"])

C = 100x1 categorical
     large 
     large 
     small 
     large 
     medium 
     small 
     medium 
     medium 
     large 
     large 
     small 
     large 
     large 
     medium 
     large 
     small 
     medium 
     large 
     large 
     large 
     medium 
     small 
     large 
     large 
     medium 
     large 
     medium 
     medium 
     medium 
     small 
      ⋮

Plot a histogram of the three categories of data.

histogram(C)

Figure contains an axes object. The axes object contains an object of type categoricalhistogram.

Produce All Combinations of Categories

Open Live Script

When you multiply two categorical arrays, the result is a categorical array with a set of new categories. The new categories are all the ordered pairs created from the categories of the two original categorical arrays. This set of all possible combinations of categories is also known as the Cartesian product of the two original sets of categories.

For example, create two categorical arrays. These arrays list blood groups and Rh factors for six patients.

bloodGroups = categorical(["A" "AB" "O" "O" "A" "A"], ...
                          ["A" "B" "AB" "O"])

bloodGroups = 1x6 categorical
     A      AB      O      O      A      A

Rhfactors = categorical(["+" "+" "-" "-" "+" "+"])

Rhfactors = 1x6 categorical
     +      +      -      -      +      +

Display the categories of the two arrays. While the two categorical arrays have the same numbers of elements, they can have different numbers of categories.

categories(bloodGroups)

ans = 4x1 cell
    {'A' }
    {'B' }
    {'AB'}
    {'O' }

categories(Rhfactors)

ans = 2x1 cell
    {'+'}
    {'-'}

Multiply the two categorical arrays. The elements of the product come from combinations of the corresponding elements from the input arrays.

bloodTypes = bloodGroups .* Rhfactors

bloodTypes = 1x6 categorical
     A +      AB +      O -      O -      A +      A +

However, the categories of the product are all the ordered pairs that can be created from the categories of the two arrays. So, it is possible that some categories are not represented by any element of the output array.

categories(bloodTypes)

ans = 8x1 cell
    {'A +' }
    {'A -' }
    {'B +' }
    {'B -' }
    {'AB +'}
    {'AB -'}
    {'O +' }
    {'O -' }

Limitations

If the input array is a numeric, datetime, or duration array, and you create category names from the values in the input, then categorical rounds them off to five significant figures.
For example, categorical([1 1.23456789]) creates category names 1 and 1.2346 from these two values. To create categories from continuous numeric, duration, or datetime data, use the discretize function.
If the input array has numeric, datetime, or duration values that are too closely spaced, then categorical cannot create category names from those values. In general, such values are too closely spaced if the difference between any two values in the input is less than about 5e-5.
For example, categorical([1 1.00001]) cannot create category names from the two numeric values because the difference between them is too small. To create categories from continuous numeric, duration, or datetime data, use the discretize function.

Tips

For a list of functions that accept or return categorical arrays, see Categorical Arrays.

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

The categorical function supports tall arrays with the following usage notes and limitations:

If the list of categories is known, a best practice is to provide the categories when you create the tall categorical array using categorical(A,valueset). If the categories are not provided, then many calculations require MATLAB^® to perform an extra pass through the data to determine the categories.

For more information, see Tall Arrays.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

Starting in R2019a, you can use categorical arrays in MATLAB code intended for code generation. For more information, see Code Generation for Categorical Arrays (MATLAB Coder) and Categorical Array Limitations for Code Generation (MATLAB Coder).

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

Distributed Arrays
Partition large arrays across the combined memory of your cluster using Parallel Computing Toolbox™.

Usage notes and limitations:

For the one input syntax B = categorical(A), the order of the categories is undefined. To enforce the order, use valueset and catnames.

For more information, see Run MATLAB Functions with Distributed Arrays (Parallel Computing Toolbox).

Version History

Introduced in R2013b

categorical

Description

Creation

Syntax

Description

Input Arguments

`A` — Input array
numeric array | logical array | categorical array | datetime array | duration array | string array | cell array of character vectors

`valueset` — Categories
`unique(A)` (default) | vector of unique values

`catnames` — Category names
string array | cell array of character vectors

`Ordinal` — Ordinal variable flag
`0` or `false` (default) | `1` or `true`

`Protected` — Protected categories flag
`0` or `false` | `1` or `true`

Examples

Create Categorical Array and Analyze Data by Category

Specify Categories Not Present in Input Array

Specify Category Names

Specify Category for Missing and Empty Inputs

Create Ordinal Categorical Array

Preallocate Array and Initialize Categories

Bin Continuous Numeric Data into Categories

Produce All Combinations of Categories

Limitations

Tips

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

Distributed Arrays
Partition large arrays across the combined memory of your cluster using Parallel Computing Toolbox™.

Version History

See Also

Topics

categorical

Description

Creation

Syntax

Description

Input Arguments

A — Input array numeric array | logical array | categorical array | datetime array | duration array | string array | cell array of character vectors

valueset — Categories unique(A) (default) | vector of unique values

catnames — Category names string array | cell array of character vectors

Ordinal — Ordinal variable flag 0 or false (default) | 1 or true

Protected — Protected categories flag 0 or false | 1 or true

Examples

Create Categorical Array and Analyze Data by Category

Specify Categories Not Present in Input Array

Specify Category Names

Specify Category for Missing and Empty Inputs

Create Ordinal Categorical Array

Preallocate Array and Initialize Categories

Bin Continuous Numeric Data into Categories

Produce All Combinations of Categories

Limitations

Tips

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

Distributed Arrays Partition large arrays across the combined memory of your cluster using Parallel Computing Toolbox™.

Version History

See Also

Topics

`A` — Input array
numeric array | logical array | categorical array | datetime array | duration array | string array | cell array of character vectors

`valueset` — Categories
`unique(A)` (default) | vector of unique values

`catnames` — Category names
string array | cell array of character vectors

`Ordinal` — Ordinal variable flag
`0` or `false` (default) | `1` or `true`

`Protected` — Protected categories flag
`0` or `false` | `1` or `true`

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

Distributed Arrays
Partition large arrays across the combined memory of your cluster using Parallel Computing Toolbox™.