Performing an SQL Style “Group By” in Matlab
On a recent project, I found that I needed to group data in one matrix by values in another matrix, but I could not find a nice way of doing this. So, I wrote a simple Matlab function to do this sort of grouping. I am describing it here in case others run into a similar need.
Let me start off by describing some sample data. Assume we have been collecting data on the number of cars that pass a checkpoint by hour for several days. We might end up with data that looks something like this:
Day | hour | Cars |
1 | 8 | 47 |
1 | 9 | 38 |
1 | 10 | 35 |
2 | 8 | 52 |
2 | 9 | 42 |
2 | 10 | 37 |
3 | 8 | 43 |
3 | 9 | 34 |
3 | 10 | 45 |
In SQL, we might do something like:
SELECT day,sum(cars) from data group by day or
SELECT hour,sum(cars) from data group by hour
What I did was create a function, I call ‘bucket’ for gathering the data into buckets.
function [bucket,total] = bucket(buckets, values)
% Create an Array of values and buckets
A = cat(2,values, buckets);% Sort the array on the buckets
B = sortrows(A,2);
sortedbuckets=B(:,2);% Get the number of elements in the array
n=size(sortedbuckets,1);% Set distinctbuckets to be different buckets. Initially, all the same
distinctbuckets(1:n) = 1;
% Then, compare the elements of the buckets with the values immediately following
distinctbuckets(2:n) = ( sortedbuckets(1:n-1) ~= sortedbuckets(2:n) );
% cumsum does a cumulative sum. As long as the bucket values are the same, the value
% in distinctbuckets will be zero and the resulting uniquebucket stays the same.
% when they are different, the value is 1 and the uniquebuckets are incremented by 1uniquebuckets=cumsum(distinctbuckets);
% This results in an index we can use for accumarry. We store the results in total
total=accumarray(uniquebuckets',B(:,1));
% And the bucket value is the unique sorted bucket values
bucket=unique(B(:,2));
return;
This works nicely for numeric values. If you have this all as a single large matrix, you can select out the appropriate columns. For example
[day,totals] = bucket(m(:,1),m(:,3)) % the first column being the day and the third column being the cars.
[hour,totals] = bucket(m(:,2),m(:,3)) % works just as nicely for grouping by the hours.
With less than a dozen lines of actual code, you can implement a quick and efficient SQL Group BY function in Matlab.
(Originally posted at Toomre Capital Management.)