How can I remove inverted repeat pairs of strings from a table?

3 次查看(过去 30 天)
Hi I wanna extract the inverted repeat pairs of strings from a 650x2 table. Let say I have the following pairs in a table:
A123.B123 B123.C123
A456.B456 B456.C456
A789.B789 B789.C789
B123.C123 A123.B123
B456.C456 A456.B456
. .
. .
. .
So as you can see there are some pairs that if we invert the order of pairing they became the same pair, for example the first pair with the fourth pair so I wanna extract those inverted repeated pairs from my table but I dont know how to do it. I tried with the "unique" function but that doesnt seems to work for inverted repeats. Any suggestions?
  3 个评论
Dyuman Joshi
Dyuman Joshi 2024-5-7
编辑:Dyuman Joshi 2024-5-7
@Paul Jimenez, There are no inverted string pairs in the data you have -
readtable('table.csv')
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
ans = 656x2 table
x10N_286_54_E7 x10N_286_54_B1 _________________ __________________ {'10N.286.55.E1'} {'10N.286.54.B1' } {'10N.286.46.E8'} {'10N.261.52.F7' } {'10N.286.54.A6'} {'10N.286.54.B1' } {'10N.286.54.A6'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A1' } {'10N.286.54.A6'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A6' } {'10N.286.55.E1'} {'10N.286.54.E1' } {'12B09' } {'12B09' } {'10N.286.54.E7'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.E1' } {'10N.286.54.E1'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A1' } {'10N.286.55.E8'} {'10N.286.55.B8' } {'10N.286.54.A6'} {'10N.286.54.E1' }
Paul Jimenez
Paul Jimenez 2024-5-7
Yes there are. If you look for example, the first pair under the headers:
10N.286.55.E1 10N.286.54.B1
and then look at the 104 pair of the table you will see that is :
10N.286.54.B1 10N.286.55.E1
which essentially is the same pair but in oposite order.

请先登录,再进行评论。

采纳的回答

Voss
Voss 2024-5-7
编辑:Voss 2024-5-7
T = readtable('table.csv')
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
T = 656x2 table
x10N_286_54_E7 x10N_286_54_B1 _________________ __________________ {'10N.286.55.E1'} {'10N.286.54.B1' } {'10N.286.46.E8'} {'10N.261.52.F7' } {'10N.286.54.A6'} {'10N.286.54.B1' } {'10N.286.54.A6'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A1' } {'10N.286.54.A6'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A6' } {'10N.286.55.E1'} {'10N.286.54.E1' } {'12B09' } {'12B09' } {'10N.286.54.E7'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.E1' } {'10N.286.54.E1'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A1' } {'10N.286.55.E8'} {'10N.286.55.B8' } {'10N.286.54.A6'} {'10N.286.54.E1' }
Here's one way to find pairs of reversed rows:
temp = string(T.(1)) == string(T.(2)).';
[r2,r1] = find(temp & temp.');
r = [r1 r2];
disp(r)
1 104 2 33 3 69 4 51 5 200 6 160 7 18 8 194 9 13 10 10 11 70 12 188 13 9 14 293 15 24 16 101 17 90 18 7 19 93 20 96 21 268 22 26 23 73 24 15 25 205 26 22 27 58 28 95 29 357 30 46 31 75 32 203 33 2 34 52 35 182 36 183 37 86 38 59 39 249 40 40 41 67 43 397 44 55 45 328 46 30 47 110 48 77 49 383 50 173 51 4 52 34 53 112 55 44 56 74 57 89 58 27 59 38 61 215 62 217 63 274 64 108 65 288 66 255 67 41 68 210 69 3 70 11 71 97 72 227 73 23 74 56 75 31 77 48 78 514 79 126 80 214 81 218 82 338 83 229 84 135 85 251 86 37 87 129 88 498 89 57 90 17 91 161 92 287 93 19 94 339 95 28 96 20 97 71 98 149 99 325 100 286 101 16 102 198 103 169 104 1 105 223 106 261 107 224 108 64 109 367 110 47 111 176 112 53 113 349 114 337 117 521 118 262 119 192 120 458 122 326 123 324 124 133 126 79 127 424 129 87 130 143 131 187 132 520 133 124 134 189 135 84 136 301 137 477 138 645 139 197 140 213 141 323 142 206 143 130 144 622 145 319 146 159 148 294 149 98 150 373 152 400 153 416 154 170 155 453 156 226 157 211 158 230 159 146 160 6 161 91 163 247 164 413 165 241 166 422 167 172 168 443 169 103 170 154 171 327 172 167 173 50 174 395 175 425 176 111 177 382 178 244 179 282 180 291 181 406 182 35 183 36 184 201 185 515 186 446 187 131 188 12 189 134 190 350 191 238 192 119 193 464 194 8 195 292 196 317 197 139 198 102 199 343 200 5 201 184 202 551 203 32 204 480 205 25 206 142 207 269 208 442 209 303 210 68 211 157 212 427 213 140 214 80 215 61 216 232 217 62 218 81 220 478 221 316 222 635 223 105 224 107 225 633 226 156 227 72 228 258 229 83 230 158 232 216 233 651 234 372 235 356 236 625 238 191 239 512 240 330 241 165 242 479 243 617 244 178 245 612 246 297 247 163 249 39 251 85 252 474 253 531 254 332 255 66 256 396 257 279 258 228 259 368 260 452 261 106 262 118 263 362 264 361 265 418 266 311 267 604 268 21 269 207 270 655 271 459 272 434 273 278 274 63 275 467 276 387 277 567 278 273 279 257 280 626 281 321 282 179 284 619 285 348 286 100 287 92 288 65 289 475 290 391 291 180 292 195 293 14 294 148 295 529 296 630 297 246 298 524 299 355 301 136 302 460 303 209 304 305 305 304 307 492 308 408 309 344 310 389 311 266 312 439 313 313 315 496 316 221 317 196 318 599 319 145 320 359 321 281 322 412 323 141 324 123 325 99 326 122 327 171 328 45 329 641 330 240 331 358 332 254 333 378 334 380 335 463 336 352 337 114 338 82 339 94 340 433 341 377 342 481 343 199 344 309 345 642 346 398 347 385 348 285 349 113 350 190 351 388 352 336 353 636 354 436 355 299 356 235 357 29 358 331 359 320 361 264 362 263 363 365 364 366 365 363 366 364 367 109 368 259 369 469 370 465 371 384 372 234 373 150 374 450 375 484 376 564 377 341 378 333 379 447 380 334 382 177 383 49 384 371 385 347 386 401 387 276 388 351 389 310 390 415 391 290 392 455 393 449 394 500 395 174 396 256 397 43 398 346 399 431 400 152 401 386 402 503 403 405 404 507 405 403 406 181 407 423 408 308 409 430 410 583 412 322 413 164 414 550 415 390 416 153 418 265 419 646 420 429 421 624 422 166 423 407 424 127 425 175 427 212 429 420 430 409 431 399 432 456 433 340 434 272 435 623 436 354 437 506 438 497 439 312 440 489 441 510 442 208 443 168 445 457 446 186 447 379 449 393 450 374 451 482 452 260 453 155 454 461 455 392 456 432 457 445 458 120 459 271 460 302 461 454 462 509 463 335 464 193 465 370 467 275 468 516 469 369 471 589 472 544 473 530 474 252 475 289 476 493 477 137 478 220 479 242 480 204 481 342 482 451 484 375 485 650 486 569 487 501 488 555 489 440 490 494 491 527 492 307 493 476 494 490 495 519 496 315 497 438 498 88 499 611 500 394 501 487 502 594 503 402 504 598 505 517 506 437 507 404 508 597 509 462 510 441 511 537 512 239 513 538 514 78 515 185 516 468 517 505 518 588 519 495 520 132 521 117 522 584 523 540 524 298 525 601 526 616 527 491 529 295 530 473 531 253 532 533 533 532 534 561 535 602 536 618 537 511 538 513 539 563 540 523 541 600 542 592 543 553 544 472 545 591 547 579 548 595 549 593 550 414 551 202 552 556 553 543 554 605 555 488 556 552 557 565 558 560 559 578 560 558 561 534 562 587 563 539 564 376 565 557 566 608 567 277 568 656 569 486 570 571 571 570 572 577 573 610 575 609 576 606 577 572 578 559 579 547 580 614 582 603 583 410 584 522 585 634 587 562 588 518 589 471 590 596 591 545 592 542 593 549 594 502 595 548 596 590 597 508 598 504 599 318 600 541 601 525 602 535 603 582 604 267 605 554 606 576 607 637 608 566 609 575 610 573 611 499 612 245 614 580 615 632 616 526 617 243 618 536 619 284 622 144 623 435 624 421 625 236 626 280 628 629 629 628 630 296 631 640 632 615 633 225 634 585 635 222 636 353 637 607 639 643 640 631 641 329 642 345 643 639 645 138 646 419 648 653 650 485 651 233 653 648 655 270 656 568
That says row 1 is a reversed copy of row 104, row 2 is a reversed copy of row 33, and so on.
Checking the first few, they do seem to be reversed pairs of rows:
T{r(1,:),:} % rows 1 and 104
ans = 2x2 cell array
{'10N.286.55.E1'} {'10N.286.54.B1'} {'10N.286.54.B1'} {'10N.286.55.E1'}
T{r(2,:),:} % rows 2 and 33
ans = 2x2 cell array
{'10N.286.46.E8'} {'10N.261.52.F7'} {'10N.261.52.F7'} {'10N.286.46.E8'}
T{r(3,:),:} % rows 3 and 69
ans = 2x2 cell array
{'10N.286.54.A6'} {'10N.286.54.B1'} {'10N.286.54.B1'} {'10N.286.54.A6'}
I'm not sure exactly what you want to do with this information.
  1 个评论
Voss
Voss 2024-5-7
Here's a slight modification that's useful for removing one of each pair of reversed rows from the table:
T = readtable('table.csv')
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
T = 656x2 table
x10N_286_54_E7 x10N_286_54_B1 _________________ __________________ {'10N.286.55.E1'} {'10N.286.54.B1' } {'10N.286.46.E8'} {'10N.261.52.F7' } {'10N.286.54.A6'} {'10N.286.54.B1' } {'10N.286.54.A6'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A1' } {'10N.286.54.A6'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A6' } {'10N.286.55.E1'} {'10N.286.54.E1' } {'12B09' } {'12B09' } {'10N.286.54.E7'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.E1' } {'10N.286.54.E1'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A1' } {'10N.286.55.E8'} {'10N.286.55.B8' } {'10N.286.54.A6'} {'10N.286.54.E1' }
temp = string(T.(1)) == string(T.(2)).';
idx = tril(temp & temp.');
idx(1:size(T,1)+1:end) = false; % to avoid removing a row that is the reverse of itself,
% set elements of idx along the diagonal to false
[r,~] = find(idx);
T(r,:) = []
T = 353x2 table
x10N_286_54_E7 x10N_286_54_B1 _________________ __________________ {'10N.286.55.E1'} {'10N.286.54.B1' } {'10N.286.46.E8'} {'10N.261.52.F7' } {'10N.286.54.A6'} {'10N.286.54.B1' } {'10N.286.54.A6'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A11'} {'10N.286.54.F7'} {'10N.286.54.A1' } {'10N.286.54.A6'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.A6' } {'10N.286.55.E1'} {'10N.286.54.E1' } {'12B09' } {'12B09' } {'10N.286.54.E7'} {'10N.286.55.E1' } {'10N.286.55.A1'} {'10N.286.54.E1' } {'10N.286.55.A1'} {'10N.286.54.A1' } {'10N.286.55.E8'} {'10N.286.55.B8' } {'10N.286.54.A6'} {'10N.286.54.E1' } {'10N.222.49.C7'} {'10N.222.46.C2' }
Checking again for reversed pairs of rows confirms that the only ones left are the reverse of themselves:
temp = string(T.(1)) == string(T.(2)).';
[r,~] = find(tril(temp & temp.'))
r = 3x1
10 35 213
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
T(r,:)
ans = 3x2 table
x10N_286_54_E7 x10N_286_54_B1 ______________ ______________ {'12B09'} {'12B09'} {'12B01'} {'12B01'} {'12G01'} {'12G01'}

请先登录,再进行评论。

更多回答(1 个)

Mathieu NOE
Mathieu NOE 2024-5-6
hello Paul
this would be my suggestion
attached your data simply pasted in a text file
hope it helps
clc
out = readcell('data.txt');
first_col = out(:,1);
second_col = out(:,2);
% main loop
n = 0;
for k = 1:numel(first_col)
tf = strcmp(first_col{k},second_col);
if any(tf)
n = n + 1; % increase counter
ind1(n,1) = k;
ind2(n,1) = find(tf);
end
end
% all matching pairs
out = [ind1 ind2]
out = 4x2
1 4 2 5 4 1 5 2
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
  2 个评论
Paul Jimenez
Paul Jimenez 2024-5-6
Thanks a lot for your answer. Unfortunately I got the following error:
Unable to perform assignment because the indices on the left side are not compatible with the size of the right side.
Error in strains (line 19)
ind2(n,1) = find(tf);
Also, is there a way to remove only one of the pairs from the out list?
Mathieu NOE
Mathieu NOE 2024-5-7
hello again
seems that in the csv file , in each column you have duplicates of strings
so I simply asked to perform the same process but taking only the unique strings in consideration , but of course it's not the same list as your original file
it is what you wanted or not ?
data = readcell('table.csv');
first_col = unique(data(:,1));
second_col = unique(data(:,2));
% main loop
n = 0;
for k = 1:numel(first_col)
tf = strcmp(first_col{k},second_col);
if any(tf)
n = n + 1; % increase counter
ind1(n,1) = k;
ind2(n,1) = find(tf);
end
end
% all matching pairs
out = [ind1 ind2]
out = 97x2
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Preprocessing Data 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by