Collating data from different sources

I have two csv datasets t1 and t2:
t1: t2:
A B C D A B C D
1 5 4.1 "a" 4 3 4.2 "e"
2 3 3.8 "g" 7 5 2.5 "c"
3 2 3.6 "d" 9 4 4.3 "b"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"
So t1 is missing some values but is the 'master' table. t2 has only the rows with missing values from t1, but with the data filled (from another source).
I am trying to fill the missing values in t1 using the values from t2, where variable 'A' is like an index (having unique values for each observation).
I have tried using outerjoin() but the result is:
t3:
tleft_A tleft_B tleft_C tleft_D tright_A tright_B tright_C tright_D
1 5 4.1 "a" NaN NaN NaN NaN
2 3 3.8 "g" NaN NaN NaN NaN
3 2 3.6 "d" NaN NaN NaN NaN
4 NaN 4.2 "e" 4 3 2.5 "e"
5 6 4.2 "h" NaN NaN NaN NaN
6 2 3.6 "d" NaN NaN NaN NaN
7 NaN 2.5 "c" 7 5 2.5 "c"
8 5 1.8 "e" NaN NaN NaN NaN
9 NaN 4.3 "b" 9 4 4.3 "b"
10 6 2.6 "a" NaN NaN NaN NaN
What I am trying to achieve, however, is:
t3:
A B C D
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 3 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
10 6 2.6 "a"
Is there a simple way to accomplish this, or do I need to extract, index and combine the variables, or approach it programmatically via a loop or something similar? None of the interpolation-type fill methods seem to be appropriate either.
I'm fairly new to MatLab, so if there is any other information I am missing which would help clarify the problem, I'll be happy to update the question with it.
Thanks in advance for any insights you can provide.

2 个评论

Please attach the csv files, use the paperclip button to do so.
Thanks for your response Dyuman.
The 'data' in the question is abstracted, the csv files are 1000+ records in length with more fields. It's actually the Titanic dataset from Kaggle. I simply extracted the records with missing data and filled them in with data collected from other expert sources. I was primarily hoping for a generalised strategy, to solve the type of problem the abstracted data exposed, since I anticipate running into the same type of issue with other datasets in the future. I will include the csv files, for your edification, but just figured I must be missing something in terms of strategy.

请先登录,再进行评论。

回答(1 个)

You could use this approach
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
9 4 4.3 "b"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t1 = 10×4 table
A B C D __ ___ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 NaN 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 NaN 2.5 "c" 8 5 1.8 "e" 9 NaN 4.3 "b" 10 6 2.6 "a"
t2 = cell2table(c2(2:end,:));
t2.Properties.VariableNames = c2(1,:)
t2 = 3×4 table
A B C D _ _ ___ ___ 4 3 4.2 "e" 7 5 2.5 "c" 9 4 4.3 "b"
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
idxMissing = 3×1
4 7 9
% Fill the values from t2
t3 = t1;
[~,idx2] = ismember(idxMissing,t2.A)
idx2 = 3×1
1 2 3
t3(idxMissing,:) = t2(idx2,:)
t3 = 10×4 table
A B C D __ _ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 3 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 5 2.5 "c" 8 5 1.8 "e" 9 4 4.3 "b" 10 6 2.6 "a"

7 个评论

Note this will also work if t2 has extra rows that don't correspond to missing data for example
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
15 8 6.2 "g"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t1 = 10×4 table
A B C D __ ___ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 NaN 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 NaN 2.5 "c" 8 5 1.8 "e" 9 NaN 4.3 "b" 10 6 2.6 "a"
t2 = cell2table(c2(2:end,:));
t2.Properties.VariableNames = c2(1,:)
t2 = 5×4 table
A B C D __ _ ___ ___ 4 3 4.2 "e" 7 5 2.5 "c" 8 5 1.8 "e" 9 4 4.3 "b" 15 8 6.2 "g"
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
idxMissing = 3×1
4 7 9
% Fill the values from t2
t3 = t1;
[~,idx2] = ismember(idxMissing,t2.A)
idx2 = 3×1
1 2 4
t3(idxMissing,:) = t2(idx2,:)
t3 = 10×4 table
A B C D __ _ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 3 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 5 2.5 "c" 8 5 1.8 "e" 9 4 4.3 "b" 10 6 2.6 "a"
Thanks for your time Jon, I appreciate the efforts you've gone to with your answer. If I'm understanding correctly, the strategy you suggest would be to extract and index the observations with the missing values from the 'master' table (as numeric arrays) and combine them with the records from the 'donor' table (similarly extracted) then repackage them in the original 'master' table? Or am I missing your point.
I had thought this might be the type of strategy I would have to go with, but it seems very specific to the particular problem, I guess I was just hoping there was a more 'general' approach like the fill functions but with provided data as opposed to interpolation or copying previous values, etc.
Thanks again, I'll look over your strategy to try to understand it better.
The approach I outline is relatively straightforward, but I understand, it would be nice if there were some more or less "built in" approach for doing this. There may be one, maybe someone else will answer and point out a simpler way to do this.
I agree, the approach you outlined is straightforward and fairly intuitive, I hope my comment didn't seem ungrateful, or detract from your effort.
I am just getting started with machine learning and had thought that this would be a fairly common problem encountered, hence thinking I was simply too inexperienced to know something blindingly obvious that everyone else was aware of.
No worries, I totally understand that you are thinking there must be nicer way to do something like this. I couldn't think of one, and wanted to make sure you at least had a way to get it done. I'm curious to know also if someone has a simpler approach for this. Let's see, maybe you'll get some more answers.
I'll give it a little time, and if no one else has a more succinct method, I'll be happy to accept this as the answer. If I come up with something else in my researching, I'll update in a comment here so you can get the notification.
Cheers again for your input Jon
Here is another approach, that utilizes operations with timetables, rather than indexing. Still a little arcane, but maybe you prefer this. You may want to remove a few more semicolons at end of statements to understand better what each step is doing
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
15 8 6.2 "g"};
% Put data into timetables, using A as the time variable
t1 = cell2table(c1(2:end,:),'VariableNames',c1(1,:));
t1.A = days(t1.A); % convert first column to a duration
t1 = table2timetable(t1);
t2 = cell2table(c2(2:end,:),'VariableNames',c2(1,:));
t2.A = days(t2.A); % convert first column to a duration
t2 = table2timetable(t2);
% Combine the tables (makes union of timetables) putting missing data
% wherever data is not found for a given time
t3 = synchronize(t1,t2)
t3 = 11×6 timetable
A B_t1 C_t1 D_t1 B_t2 C_t2 D_t2 _______ ____ ____ _________ ____ ____ _________ 1 day 5 4.1 "a" NaN NaN <missing> 2 days 3 3.8 "g" NaN NaN <missing> 3 days 2 3.6 "d" NaN NaN <missing> 4 days NaN 4.2 "e" 3 4.2 "e" 5 days 6 4.2 "h" NaN NaN <missing> 6 days 2 3.6 "d" NaN NaN <missing> 7 days NaN 2.5 "c" 5 2.5 "c" 8 days 5 1.8 "e" 5 1.8 "e" 9 days NaN 4.3 "b" 4 4.3 "b" 10 days 6 2.6 "a" NaN NaN <missing> 15 days NaN NaN <missing> 8 6.2 "g"
% Replace the missing values using values from t2
% note by defualt, min omits missing, e.g. NaN
t3.B_t1 = min(t3.B_t1,t3.B_t2);
% Remove columns from t2
t3 = t3(:,1:3);
% Remove any rows that were in t2 but not t1
% These will have missing data from the synchronization
t3 = rmmissing(t3);
% Optionally clean up the variable names
t3.Properties.VariableNames = c1(1,2:end)
t3 = 10×3 timetable
A B C D _______ _ ___ ___ 1 day 5 4.1 "a" 2 days 3 3.8 "g" 3 days 2 3.6 "d" 4 days 3 4.2 "e" 5 days 6 4.2 "h" 6 days 2 3.6 "d" 7 days 5 2.5 "c" 8 days 5 1.8 "e" 9 days 4 4.3 "b" 10 days 6 2.6 "a"

请先登录,再进行评论。

类别

帮助中心File Exchange 中查找有关 Logical 的更多信息

提问:

2023-10-4

编辑:

Jon
2023-10-5

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by