Main Content

categorical

Array that contains values assigned to categories

Description

categorical is a data type that assigns values to a finite set of discrete categories, such as High, Med, and Low. These categories can have a mathematical ordering that you specify, such as High > Med > Low, but it is not required. A categorical array provides efficient storage and convenient manipulation of nonnumeric data, while also maintaining meaningful names for the values. A common use of categorical arrays is to define groups of rows in a table.

Creation

To create a categorical array:

  • Use the categorical function as described below.

  • Bin continuous data using the discretize function. Return the bins as a categorical array.

  • Multiply two categorical arrays. The product is a categorical array whose categories are all possible combinations of the categories of the two operands.

Description

B = categorical(A) creates a categorical array from the input array. The categories of the output array are the sorted unique values from the input array.

example

B = categorical(A,valueset) creates one category for each value in valueset. The categories of B are in the same order as the values of valueset.

You can use valueset to include categories for values not present in A. Conversely, if A contains any values not present in valueset, then the corresponding elements of B are undefined.

example

B = categorical(A,valueset,catnames) names categories by matching the category values in valueset to the corresponding names in catnames.

example

B = categorical(A,___,Name=Value) specifies options using one or more name-value arguments in addition to the input arguments in previous syntaxes. For example, to indicate that the categories have a mathematical ordering, set Ordinal to true.

example

Input Arguments

expand all

Input array, specified as a numeric array, logical array, categorical array, datetime array, duration array, string array, or cell array of character vectors.

The categorical function removes leading and trailing spaces from input values that are strings or character vectors.

If the input A contains missing values, then the corresponding element of the output array is undefined and displays as <undefined>. The categorical function converts the following values to undefined categorical values:

  • NaN in numeric and duration arrays

  • The missing string (<missing>) or the empty string ("") in string arrays

  • The empty character vector ('') in cell arrays of character vectors

  • NaT in datetime arrays

  • Undefined values (<undefined>) in categorical arrays

The output array does not have a category for undefined values. To create an explicit category for missing or undefined values, you must include the desired category name in catnames, and a missing value as the corresponding value in valueset.

The input A also can be an array of objects with the following class methods:

  • unique

  • eq

Categories, specified as a vector of unique values. The data type of valueset and the data type of the input array must be the same, except when the input is a string array. In that case, valueset can be either a string array or a cell array of character vectors.

The categorical function removes leading and trailing spaces from elements of valueset that are strings or character vectors.

Category names, specified as a string array or a cell array of character vectors. If you do not specify the catnames input argument, then categorical uses the values in valueset as category names.

The category names cannot include a missing string (<missing>), an empty string (""), or an empty character vector ('').

To merge multiple distinct values from the input array into a single category in the output array, include duplicate names corresponding to those values.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: categorical(A,Ordinal=true) specifies that the categories have a mathematical ordering.

Ordinal variable flag, specified as a numeric or logical 0 (false) or 1 (true).

0 (false)

categorical creates a categorical array that is not ordinal, which is the default behavior.

The categories of the output array have no mathematical ordering. Therefore, you can compare the values in the output for equality only. You cannot compare the values using any other relational operator.

1 (true)

categorical creates an ordinal categorical array.

The categories of the output array have a mathematical ordering, such that the first category specified is the smallest and the last category is the largest. You can compare the values in the output using relational operators, such as less than and greater than, in addition to comparing the values for equality. You also can use the min and max functions on an ordinal categorical array.

For more information, see Ordinal Categorical Arrays.

Protected categories flag, specified as a numeric or logical 0 (false) or 1 (true).

The categories of ordinal categorical arrays are always protected. If you set Ordinal to true, then the default value of Protected is also true. Otherwise, the default value of Protected is false.

0 (false)

When you assign new values to the output array, the categories update automatically. Therefore, you can combine (nonordinal) categorical arrays that have different categories. The categories can update accordingly to include the categories from both arrays.

1 (true)

When you assign new values to the output array, the values must belong to one of the existing categories. Therefore, you can only combine arrays that have the same categories. To add new categories to the output, you must use the function addcats.

Examples

collapse all

Create a categorical array from a list of weather station codes. Then add it to a table of temperature readings. Use the categorical array to help you analyze the data in the table by category.

First, create an array of weather station codes.

Stations = ["S1" "S2" "S1" "S3" "S2"]
Stations = 1x5 string
    "S1"    "S2"    "S1"    "S3"    "S2"

To create a categorical array from the weather station codes, use the categorical function.

Stations = categorical(Stations)
Stations = 1x5 categorical
     S1      S2      S1      S3      S2 

Display the categories. The three station codes are the categories.

categories(Stations)
ans = 3x1 cell
    {'S1'}
    {'S2'}
    {'S3'}

Now create a table that contains weather data. The table has temperatures, dates, and station codes.

Temperatures = [58;72;56;90;76];
Dates = datetime(["2017-04-17";"2017-04-18";"2017-04-30";"2017-05-01";"2017-04-27"]);
Stations = Stations';
tempReadings = table(Temperatures,Dates,Stations)
tempReadings=5×3 table
    Temperatures       Dates       Stations
    ____________    ___________    ________

         58         17-Apr-2017       S1   
         72         18-Apr-2017       S2   
         56         30-Apr-2017       S1   
         90         01-May-2017       S3   
         76         27-Apr-2017       S2   

Categorize the data in the table by weather station. For example, return table rows that have data for station S2. Index into the table using an array of logical indices where Stations equals S2.

TF = (tempReadings.Stations == "S2")
TF = 5x1 logical array

   0
   1
   0
   0
   1

tempReadings(TF,:)
ans=2×3 table
    Temperatures       Dates       Stations
    ____________    ___________    ________

         72         18-Apr-2017       S2   
         76         27-Apr-2017       S2   

To find patterns in the data associated with weather stations, make a scatter plot of temperature readings by station.

scatter(tempReadings,"Stations","Temperatures","filled")

Figure contains an axes object. The axes object with xlabel Stations, ylabel Temperatures contains an object of type scatter.

Convert a string array to a categorical array. Specify that the categorical array has a set of categories that includes a value that is not present in the original array.

First, create a string array that has a set of repeated values.

A = ["red" "blue" "blue" "blue" "blue" "red"]
A = 1x6 string
    "red"    "blue"    "blue"    "blue"    "blue"    "red"

Convert the string array to a categorical array. Specify its categories. Include green as a category.

valueset = ["blue" "red" "green"];
B = categorical(A,valueset)
B = 1x6 categorical
     red      blue      blue      blue      blue      red 

Display the categories of the categorical array. It has a category that did not come from the input string array.

categories(B)
ans = 3x1 cell
    {'blue' }
    {'red'  }
    {'green'}

Create a numeric array.

A = [1 3 2; 2 1 3; 3 1 2]
A = 3×3

     1     3     2
     2     1     3
     3     1     2

Convert the numeric array to a categorical array. Specify the values and the names for the categories.

B = categorical(A,[1 2 3],["red" "green" "blue"])
B = 3x3 categorical
     red        blue      green 
     green      red       blue  
     blue       red       green 

Display the categories.

categories(B)
ans = 3x1 cell
    {'red'  }
    {'green'}
    {'blue' }

B is not an ordinal categorical array. Therefore, you can compare the values in B only using the equality operators, == and ~=.

Find the elements that belong to the category red. Access those elements using logical indexing.

TF = (B == "red")
TF = 3x3 logical array

   1   0   0
   0   1   0
   0   1   0

B(TF)
ans = 3x1 categorical
     red 
     red 
     red 

By default, the categorical function converts missing values (such as NaNs, NaTs, empty strings, and missing strings) into undefined categorical values. However, when you call categorical you can specify a category for missing values to belong to.

For example, create a string array that includes an empty string and a missing string.

A = ["hi" "lo" missing "" "lo" "lo" "hi"]
A = 1x7 string
    "hi"    "lo"    <missing>    ""    "lo"    "lo"    "hi"

First, convert the string array to a categorical array with undefined elements.

C = categorical(A)
C = 1x7 categorical
     hi      lo      <undefined>      <undefined>      lo      lo      hi 

categories(C)
ans = 2x1 cell
    {'hi'}
    {'lo'}

Then, convert it again. But this time specify INDEF as the category for missing strings.

C = categorical(A,["lo" "hi" missing],["lo" "hi" "INDEF"])
C = 1x7 categorical
     hi      lo      INDEF      <undefined>      lo      lo      hi 

categories(C)
ans = 3x1 cell
    {'lo'   }
    {'hi'   }
    {'INDEF'}

Specify INDEF as the category for both missing and empty strings.

C = categorical(A,["lo" "hi" missing ""],["lo" "hi" "INDEF" "INDEF"])
C = 1x7 categorical
     hi      lo      INDEF      INDEF      lo      lo      hi 

categories(C)
ans = 3x1 cell
    {'lo'   }
    {'hi'   }
    {'INDEF'}

Create a 5-by-2 numeric array.

A = [3 2;3 3;3 2;2 1;3 2]
A = 5×2

     3     2
     3     3
     3     2
     2     1
     3     2

Convert A to an ordinal categorical array where 1, 2, and 3 represent the categories child, adult, and senior respectively.

valueset = [1 2 3];
catnames = ["child" "adult" "senior"];
B = categorical(A,valueset,catnames,Ordinal=true)
B = 5x2 categorical
     senior      adult  
     senior      senior 
     senior      adult  
     adult       child  
     senior      adult  

Because B is ordinal, the categories of B have a mathematical ordering, child < adult < senior. You can use all relational operators with ordinal categorical values. For example, return the elements that have a value greater than adult.

TF = B > "adult"
TF = 5x2 logical array

   1   0
   1   1
   1   0
   0   0
   1   0

B(TF)
ans = 5x1 categorical
     senior 
     senior 
     senior 
     senior 
     senior 

You can preallocate a categorical array of any size by creating an array of NaNs and converting it to a categorical array. After you preallocate the array, you can initialize its categories by specifying category names and adding the categories to the array.

First create an array of NaNs. You can create an array having any size. For example, create a 2-by-4 array of NaNs.

A = NaN(2,4)
A = 2×4

   NaN   NaN   NaN   NaN
   NaN   NaN   NaN   NaN

Then preallocate a categorical array by converting the array of NaNs. The categorical function converts NaNs to undefined categorical values. Just as a NaN represents "not a number", <undefined> represents a categorical value that does not belong to a category.

A = categorical(A)
A = 2x4 categorical
     <undefined>      <undefined>      <undefined>      <undefined> 
     <undefined>      <undefined>      <undefined>      <undefined> 

In fact, at this point A has no categories.

categories(A)
ans =

  0x0 empty cell array

To initialize the categories of A, specify category names and add them to A by using the addcats function. For example, add small, medium, and large as three categories of A.

A = addcats(A,["small" "medium" "large"])
A = 2x4 categorical
     <undefined>      <undefined>      <undefined>      <undefined> 
     <undefined>      <undefined>      <undefined>      <undefined> 

While the elements of A are undefined values, the categories have been initialized by addcats.

categories(A)
ans = 3x1 cell
    {'small' }
    {'medium'}
    {'large' }

Now that A has categories, you can assign defined categorical values as elements of A.

A(1) = "medium";
A(8) = "small";
A(3:5) = "large"
A = 2x4 categorical
     medium           large      large            <undefined> 
     <undefined>      large      <undefined>      small       

The discretize function is recommended for creating categories out of continuous data, particularly when there are input values that are closely spaced. Two values are closely spaced when the difference between them is less than about 5e-5. When values are closely spaced, the categorical function cannot create unique category names from the values.

Create a numeric array with 100 random numbers.

X = rand(100,1)
X = 100×1

    0.8147
    0.9058
    0.1270
    0.9134
    0.6324
    0.0975
    0.2785
    0.5469
    0.9575
    0.9649
      ⋮

To bin the numbers into three categories, use discretize. Specify bin boundaries and category names for the bins.

C = discretize(X,[0 .25 .75 1],"categorical",["small" "medium" "large"])
C = 100x1 categorical
     large 
     large 
     small 
     large 
     medium 
     small 
     medium 
     medium 
     large 
     large 
     small 
     large 
     large 
     medium 
     large 
     small 
     medium 
     large 
     large 
     large 
     medium 
     small 
     large 
     large 
     medium 
     large 
     medium 
     medium 
     medium 
     small 
      ⋮

Plot a histogram of the three categories of data.

histogram(C)

Figure contains an axes object. The axes object contains an object of type categoricalhistogram.

When you multiply two categorical arrays, the result is a categorical array with a set of new categories. The new categories are all the ordered pairs created from the categories of the two original categorical arrays. This set of all possible combinations of categories is also known as the Cartesian product of the two original sets of categories.

For example, create two categorical arrays. These arrays list blood groups and Rh factors for six patients.

bloodGroups = categorical(["A" "AB" "O" "O" "A" "A"], ...
                          ["A" "B" "AB" "O"])
bloodGroups = 1x6 categorical
     A      AB      O      O      A      A 

Rhfactors = categorical(["+" "+" "-" "-" "+" "+"])
Rhfactors = 1x6 categorical
     +      +      -      -      +      + 

Display the categories of the two arrays. While the two categorical arrays have the same numbers of elements, they can have different numbers of categories.

categories(bloodGroups)
ans = 4x1 cell
    {'A' }
    {'B' }
    {'AB'}
    {'O' }

categories(Rhfactors)
ans = 2x1 cell
    {'+'}
    {'-'}

Multiply the two categorical arrays. The elements of the product come from combinations of the corresponding elements from the input arrays.

bloodTypes = bloodGroups .* Rhfactors
bloodTypes = 1x6 categorical
     A +      AB +      O -      O -      A +      A + 

However, the categories of the product are all the ordered pairs that can be created from the categories of the two arrays. So, it is possible that some categories are not represented by any element of the output array.

categories(bloodTypes)
ans = 8x1 cell
    {'A +' }
    {'A -' }
    {'B +' }
    {'B -' }
    {'AB +'}
    {'AB -'}
    {'O +' }
    {'O -' }

Limitations

  • If the input array is a numeric, datetime, or duration array, and you create category names from the values in the input, then categorical rounds them off to five significant figures.

    For example, categorical([1 1.23456789]) creates category names 1 and 1.2346 from these two values. To create categories from continuous numeric, duration, or datetime data, use the discretize function.

  • If the input array has numeric, datetime, or duration values that are too closely spaced, then categorical cannot create category names from those values. In general, such values are too closely spaced if the difference between any two values in the input is less than about 5e-5.

    For example, categorical([1 1.00001]) cannot create category names from the two numeric values because the difference between them is too small. To create categories from continuous numeric, duration, or datetime data, use the discretize function.

Tips

Extended Capabilities

Thread-Based Environment
Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

Version History

Introduced in R2013b