Join tables based on times that are within 3 hours of each other
1 次查看(过去 30 天)
显示 更早的评论
I have two tables that should be joined based on matching id values, but I would only like to join those values which have data collected within three hours of each other.
The joined table AB does not ned to be in that specific order or format, it is only an example. It just needs to contain the correct dataA and dataB which match both the id and the condition that the time is within 3 hours of each other.
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB' 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
["Red"; "White"; "Green"], ...
[9999; 34 ; 3333],...
'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
A
B
AB
采纳的回答
J. Alex Lee
2023-10-16
编辑:J. Alex Lee
2023-10-16
This may be one such "optimization", though i have no idea if it is faster, but it maybe is less "brute force"?
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
% B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
% [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
% datetime('2023-09-29 05:00:00')], ...
% ["Red"; "White"; "Green"; "Orange"], ...
% 'VariableNames', {'idB', 'dateB' 'dataB'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
["Red"; "White"; "Green"], ...
[9999; 34 ; 3333],...
'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
%% work with timetables so you can use things like retime
TA = sortrows(table2timetable(A,"RowTimes",A.dateA));
TB = sortrows(table2timetable(B,"RowTimes",B.dateB));
% break up into id's to reduce logic needed for meeting both conditions
idList = intersect(TA.idA,TB.idB);
% pre-allocate for joined tables by ID
TAB = cell(numel(idList),1);
% iterate through id's
for k = 1:numel(idList)
subA = TA(TA.idA==idList(k),:);
subB = TB(TB.idB==idList(k),:);
% retime to the table with fewer entries
% otherwise I think you duplicate entries and have to trim later
if height(subA) < height(subB)
src = subA;
tgt = subB;
else
src = subB;
tgt = subA;
end
% retime the target table to the timestamps of the source table
tgtR = retime(tgt,src.Time,"nearest");
% join the tables using built-in
tmp = innerjoin(src,tgtR);
% discriminate based on difference between times
% this is already the closest match there is
% if there are multiple matches within 3 hrs they will be already discounted
isFar = abs(tmp.dateB-tmp.dateA) > duration(3,0,0);
% remove entries without matches within 3 hours
tmp(isFar,:) = [];
% save the merged table (for a single id)
TAB{k} = tmp;
end
AB_2 = timetable2table(vertcat(TAB{:}))
3 个评论
dpb
2023-10-16
编辑:dpb
2023-10-17
Probably "not so much" if either A or B isn't known a priori to be the dominant one to match against, but if it doesn't matter, then picking either A or B for the list of IDs is ok; since there has to be one in each to match, if you get the odd one(s) in the list you parse or not doesn't matter; the list of which are in both has to be complete regardless of which list you use.
I didn't read his solution closely enough to follow the need for retime, anyway; since the time difference between the two is the selection criterion, it doesn't matter which is the reference, it would simply change the sign of the difference depending on which is compared to which and it was not specified the time difference was signed.
One (I assumed probably minor) optimization in mine would be to get the unique list that isn't in both and iterate over it instead of just picking the one list and working down its members one-by-one as I did. I started on that path and then went the straight list route as I read the initial follow on comment of there being only the one unique time as meaning there was only one--hence the issue with duplicates had to fix later...
J. Alex Lee
2023-10-16
Retime is my way of finding nearest timestamp, mistakenly thought it would be faster than manually searching for each row...oops ;)
更多回答(1 个)
dpb
2023-10-16
编辑:dpb
2023-10-16
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
A,B
AB=[];
for i=1:height(A)
ixU=(B.idB==A.idA(i));
if any(ixU)
ixT=false(size(ixU));
for j=1:numel(ixU)
if ~ixU(j), continue, end
%[i j], [A.dateA(i),B.dateB(j)]
ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
ixB=ixU&ixT;
if any(ixB)
AB=[AB;table(A.idA(i),A.dateA(i),B.dateB(ixB),A.dateA(i)-B.dateB(ixB),A.dataA(i),B.dataB(ixB))];
end
end
end
end
AB.Properties.VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=unique(AB,'rows');
AB
Above is brute force; depends upon how large the real tables are as to how much optimization might be called for. If numel(ixB) can be > 1 in the wild, then the subsequent assignment needs must be in another loop, of course.
17 个评论
dpb
2023-10-18
I guess it is, it is a matching of element within an array which is what ismember does. Have to think about any alternative syntax that could possibly be more efficient.
The conversion to categorical ought to be done initially to the overall tables; that shouldn't take long
>> B=repmat(B,round(10000/height(B)),1);
>> whos B
Name Size Bytes Class Attributes
B 10000x3 1161609 table
>> tic;B=convertvars(B,{'idB','dataB'},'categorical');toc
Elapsed time is 0.015839 seconds.
>>
Remember then to change the variable type in the empty table to match so there's not a conversion to/from during the working loop. From
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
'VariableNames',VariableNames);
to
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'categorical','datetime','datetime','duration','double','categorical'}, ...
'VariableNames',VariableNames);
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Dates and Time 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!