Performing an SQL Style “Group By” in Matlab

On a recent project, I found that I needed to group data in one matrix by values in another matrix, but I could not find a nice way of doing this. So, I wrote a simple Matlab function to do this sort of grouping. I am describing it here in case others run into a similar need.

Let me start off by describing some sample data. Assume we have been collecting data on the number of cars that pass a checkpoint by hour for several days. We might end up with data that looks something like this:

Day hour Cars
1 8 47
1 9 38
1 10 35
2 8 52
2 9 42
2 10 37
3 8 43
3 9 34
3 10 45

In SQL, we might do something like:
SELECT day,sum(cars) from data group by day or
SELECT hour,sum(cars) from data group by hour

What I did was create a function, I call ‘bucket’ for gathering the data into buckets.

function [bucket,total] = bucket(buckets, values)

% Create an Array of values and buckets
A = cat(2,values, buckets);

% Sort the array on the buckets

B = sortrows(A,2);
sortedbuckets=B(:,2);

% Get the number of elements in the array
n=size(sortedbuckets,1);

% Set distinctbuckets to be different buckets. Initially, all the same

distinctbuckets(1:n) = 1;

% Then, compare the elements of the buckets with the values immediately following

distinctbuckets(2:n) = ( sortedbuckets(1:n-1) ~= sortedbuckets(2:n) );

% cumsum does a cumulative sum. As long as the bucket values are the same, the value
% in distinctbuckets will be zero and the resulting uniquebucket stays the same.
% when they are different, the value is 1 and the uniquebuckets are incremented by 1

uniquebuckets=cumsum(distinctbuckets);

% This results in an index we can use for accumarry. We store the results in total

total=accumarray(uniquebuckets',B(:,1));

% And the bucket value is the unique sorted bucket values

bucket=unique(B(:,2));

return;

This works nicely for numeric values. If you have this all as a single large matrix, you can select out the appropriate columns. For example

[day,totals] = bucket(m(:,1),m(:,3)) % the first column being the day and the third column being the cars.
[hour,totals] = bucket(m(:,2),m(:,3)) % works just as nicely for grouping by the hours.

With less than a dozen lines of actual code, you can implement a quick and efficient SQL Group BY function in Matlab.

(Originally posted at Toomre Capital Management.)

(Categories: )