Main Content

Grouped Statistics Calculations with Tall Arrays

This example shows how to calculate grouped statistics of a tall timetable containing power outage data. The example uses the grouptransform, groupsummary, and groupcounts functions to calculate various quantities of interest, such as the most common power outage cause in each region. Even though the raw data in this example only has about 1500 rows, you can use the techniques presented here on much larger data sets because no assumptions are made about the size of the data.

Create Datastore and Tall Timetable

The sample file, outages.csv, contains data representing electric utility outages in the United States. The file contains six columns: Region, OutageTime, Loss, Customers, RestorationTime, and Cause.

Create a datastore for the outages.csv file. Use the "TextScanFormats" option to specify the kind of data each column contains: categorical ("%C"), floating-point numeric ("%f"), or datetime ("%D").

data_formats = ["%C","%D","%f","%f","%D","%C"];
ds = tabularTextDatastore("outages.csv","TextscanFormats",data_formats);

Create a tall table on top of the datastore, and convert the tall table into a tall timetable. The OutageTime variable is used for the row times since it is the first datetime or duration variable in the table.

T = tall(ds);
T = table2timetable(T)
T =

  Mx5 tall timetable

    OutageTime    Region    Loss    Customers    RestorationTime    Cause
    __________    ______    ____    _________    _______________    _____

        ?           ?        ?          ?               ?             ?  
        ?           ?        ?          ?               ?             ?  
        ?           ?        ?          ?               ?             ?  
        :           :        :          :               :             :
        :           :        :          :               :             :

Replace Missing Data

Some of the rows in the RestorationTime variable have missing times, represented by NaT values. Remove these rows from the table.

T = rmmissing(T,"DataVariables","RestorationTime");

For the numeric variables in the timetable, instead of removing rows with missing values, replace the missing values with the mean value for each region.

T = grouptransform(T,"Region","meanfill",["Loss","Customers"]);

Use ismissing to confirm that no pieces of missing data remain in the table.

tf = any(ismissing(T),"all");
gather(tf)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 4: Completed in 0.3 sec
- Pass 2 of 4: Completed in 0.48 sec
- Pass 3 of 4: Completed in 0.37 sec
- Pass 4 of 4: Completed in 0.33 sec
Evaluation completed in 2.2 sec
ans = logical
   0

Preview Data

Now that the data does not contain missing values, bring a small number of rows into memory to get an idea of what the data contains.

gather(head(T))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 0.2 sec
Evaluation completed in 0.29 sec
ans=8×5 timetable
       OutageTime        Region       Loss     Customers     RestorationTime          Cause     
    ________________    _________    ______    __________    ________________    _______________

    2002-02-01 12:18    SouthWest    458.98    1.8202e+06    2002-02-07 16:50    winter storm   
    2003-02-07 21:15    SouthEast     289.4    1.4294e+05    2003-02-17 08:14    winter storm   
    2004-04-06 05:44    West         434.81    3.4037e+05    2004-04-06 06:10    equipment fault
    2002-03-16 06:18    MidWest      186.44    2.1275e+05    2002-03-18 23:23    severe storm   
    2003-06-18 02:49    West              0             0    2003-06-18 10:54    attack         
    2004-06-20 14:39    West         231.29    1.5354e+05    2004-06-20 19:16    equipment fault
    2002-06-06 19:28    West         311.86    1.5354e+05    2002-06-07 00:51    equipment fault
    2003-07-16 16:23    NorthEast    239.93         49434    2003-07-17 01:12    fire           

Mean Power Outage Duration by Region

Determine the mean power outage duration in each region using groupsummary. First, create a new variable OutageDuration in the table that contains the duration of each outage, found by subtracting the outage time from the restoration time. In the call to groupsummary, specify:

  • "Region" as the grouping variable

  • "mean" as the computation method

  • "OutageDuration" as the variable to operate on.

T.OutageDuration = T.RestorationTime - T.OutageTime;
times = groupsummary(T,"Region","mean","OutageDuration")
times =

  Mx3 tall table

    Region    GroupCount    mean_OutageDuration
    ______    __________    ___________________

      ?           ?                  ?         
      ?           ?                  ?         
      ?           ?                  ?         
      :           :                  :
      :           :                  :

Change the display format of the duration results to be in days, and gather the results into memory. The results show the mean outage duration in each region, as well as the number of reported outages in each region.

times.mean_OutageDuration.Format = "d";
times = gather(times)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0.85 sec
- Pass 2 of 2: Completed in 0.37 sec
Evaluation completed in 1.6 sec
times=5×3 table
     Region      GroupCount    mean_OutageDuration
    _________    __________    ___________________

    MidWest         138            34.135 days    
    NorthEast       548             24.21 days    
    SouthEast       379            1.7013 days    
    SouthWest        25            2.4799 days    
    West            349            28.061 days    

Most Common Power Outage Causes by Region

Determine how often each power outage cause occurs in each region. Use groupcounts with the Cause and Region variables as grouping variables. Gather the results into memory.

causes = groupcounts(T,["Cause","Region"]);
causes = gather(causes)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0.3 sec
- Pass 2 of 2: Completed in 0.22 sec
Evaluation completed in 1.3 sec
causes=43×4 table
         Cause           Region      GroupCount    Percent 
    ________________    _________    __________    ________

    attack              MidWest          12         0.83391
    attack              NorthEast       135          9.3815
    attack              SouthEast        19          1.3204
    attack              West            126          8.7561
    earthquake          NorthEast         1        0.069493
    earthquake          West              1        0.069493
    energy emergency    MidWest          19          1.3204
    energy emergency    NorthEast        29          2.0153
    energy emergency    SouthEast        79          5.4899
    energy emergency    SouthWest         7         0.48645
    energy emergency    West             46          3.1967
    equipment fault     MidWest           9         0.62543
    equipment fault     NorthEast        17          1.1814
    equipment fault     SouthEast        40          2.7797
    equipment fault     SouthWest         2         0.13899
    equipment fault     West             85          5.9069
      ⋮

Each cause occurs several times in the table, so even though the table contains the correct data it is not in the proper format to see how often each cause occurs in each region. To improve the presentation of the data, unstack the GroupCount variable so that each column corresponds to a region and each row corresponds to an outage cause.

RegionCauses = gather(unstack(causes,"GroupCount","Region","GroupingVariables","Cause"))
RegionCauses=10×6 table
         Cause          MidWest    NorthEast    SouthEast    SouthWest    West
    ________________    _______    _________    _________    _________    ____

    attack                 12         135           19          NaN       126 
    earthquake            NaN           1          NaN          NaN         1 
    energy emergency       19          29           79            7        46 
    equipment fault         9          17           40            2        85 
    fire                  NaN           5            3          NaN        17 
    severe storm           30         139          132            6        22 
    thunder storm          31         102           54            6         7 
    unknown                 4          10            3          NaN         4 
    wind                   16          40           13            3        22 
    winter storm           17          70           36            1        19 

Not all combinations of outage causes and regions are represented in the data, so the resulting table contains some NaNs. Fill in the NaN values with zeros.

RegionCauses = fillmissing(RegionCauses,"constant",{"",0,0,0,0,0})
RegionCauses=10×6 table
         Cause          MidWest    NorthEast    SouthEast    SouthWest    West
    ________________    _______    _________    _________    _________    ____

    attack                12          135           19           0        126 
    earthquake             0            1            0           0          1 
    energy emergency      19           29           79           7         46 
    equipment fault        9           17           40           2         85 
    fire                   0            5            3           0         17 
    severe storm          30          139          132           6         22 
    thunder storm         31          102           54           6          7 
    unknown                4           10            3           0          4 
    wind                  16           40           13           3         22 
    winter storm          17           70           36           1         19 

Worst Power Outages in Each Region

Calculate the broadest customer impact for each power outage in each region.

WorstOutages = groupsummary(T,["Region","Cause"],"max","Customers");
WorstOutages = gather(WorstOutages)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0.91 sec
- Pass 2 of 2: Completed in 0.41 sec
Evaluation completed in 1.6 sec
WorstOutages=43×4 table
     Region           Cause          GroupCount    max_Customers
    _________    ________________    __________    _____________

    MidWest      attack                  12         2.4403e+05  
    MidWest      energy emergency        19         5.0376e+05  
    MidWest      equipment fault          9         2.4403e+05  
    MidWest      severe storm            30          3.972e+06  
    MidWest      thunder storm           31         3.8233e+05  
    MidWest      unknown                  4         3.0879e+06  
    MidWest      wind                    16         2.8666e+05  
    MidWest      winter storm            17         7.7697e+05  
    NorthEast    attack                 135         1.5005e+05  
    NorthEast    earthquake               1                  0  
    NorthEast    energy emergency        29         1.5005e+05  
    NorthEast    equipment fault         17          1.667e+05  
    NorthEast    fire                     5         4.5139e+05  
    NorthEast    severe storm           139         1.0735e+06  
    NorthEast    thunder storm          102         5.9689e+06  
    NorthEast    unknown                 10         2.4983e+06  
      ⋮

Combine the data in the Region and Cause variables into a single categorical variable by briefly converting them into strings. Then, create a categorical histogram of the maximum number of affected customers for each cause in each region.

WorstOutages.RegionCause = categorical(string(WorstOutages.Region)+" ("+string(WorstOutages.Cause)+")");
histogram("Categories",WorstOutages.RegionCause,"BinCounts",WorstOutages.max_Customers,...
    "DisplayOrder","descend")
ylabel("Max # Affected Customers")

Figure contains an axes object. The axes object with ylabel Max # Affected Customers contains an object of type categoricalhistogram.

See Also

| |

Related Topics