Regexp to extract all characters in a varied string up to match.

Question

Marshall 2014-11-12

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/162373-regexp-to-extract-all-characters-in-a-varied-string-up-to-match

评论： Geoff Hayes 2014-11-13

Hello userbase,

I'm new to regexes. I'm working with some transistor test data and trying to extract information from .csv file names for sorting prior to further probing.

They have often a format such as this:

target = Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv
target = Some Other Test [123456_LS (further info including dates and temperatures)].csv

I want to extract the entire string up to the HS variant, including the optional number that follows it, as this represents the device and test. The further info relates to parameters.

The Some Test Performed section can be single or multiple words, contain special characters (&-_).

I'm looking for HS, LS, HS1, HS2, HS3, LS1, LS2, LS3.

I've tried lookbehind assertions, but it feels cludgy and I've guessed a bit:

pattern = '(?<=((HS)|(HS)\d|(LS)|(LS)\d))\s'

How can I improve this?

What does the ? normally do? (I see that here is a special case for the lookaround.)

My desired regexp(target, pattern, 'match') output would be:

match = Some Test Performed [12345678987_HS1
match = Some Other Test [123456_HS

Or at least the index of the final character so I could use target{1:match} to extract my string. Is there some useful 'from start or target until match' metacharacter?

Best regards and thanks for reading, Marshall

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Geoff Hayes 2014-11-12

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/162373-regexp-to-extract-all-characters-in-a-varied-string-up-to-match#answer_158655

在 MATLAB Online 中打开

Marshall - if all of your target strings (the csv filenames) have an open bracket *(* in them, and you want all the characters before that, then you could use a strfind call to get the index of the open bracket, and then copy all characters up to that index. Something like

 target = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
 idx    = strfind(target, '(');
 if ~isempty(idx)
     match = strtrim(target(1:idx-1));
 end

which would return

 match =
    Some Test Performed [12345678987_HS1

However, if the open bracket rule is not valid for all cases, then you could try simplifying your pattern to

pattern = '.+[HL]S[\d\s]';

where

.+ means match on one or more single characters including whitespace (the plus sign means one or more);

[HL]S means a single character match on either H or L followed by an S; and

[\d\s] means match on either a single numeric character or any whitespace character.

So with your two target strings above, using this pattern we would see

 target1 = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
 target2 = 'Some Other Test [123456_LS (further info including dates and temperatures)].csv';
 pattern = '.+[HL]S[\d\s]';
 match1 = regexp(target1,pattern,'match');
 match2 = regexp(target2,pattern,'match');

with

 match1 = 
    'Some Test Performed [12345678987_HS1'
 match2 = 
    'Some Other Test [123456_LS '

A problem with the above pattern may occur when there are additional HS or LS characters that follow the first pattern match. For example, if your target is

 target3 = 'Some HS Test Performed [12345678987_HS1 (further info including dates and HS temperatures)].csv';
 match3 = regexp(target3,pattern,'match')

then string is found to be

 match3 = 
    'Some HS Test Performed [12345678987_HS1 (further info including dates and HS '

So you may want to narrow down the pattern to that where a numeric string followed by an underscore precedes your original pattern

 newPattern = '.+\d+_[HL]S[\d\s]'; 
 match3     = regexp(target3,newPattern,'match')

which returns the desired

 match3 = 
    'Some HS Test Performed [12345678987_HS1'

This new pattern will work for the other two targets as well.

Note that for the second match, we have a trailing whitespace character. You may want to wrap your regexp with a strtrim to remove it.

2 个评论
显示无隐藏无

Marshall 2014-11-13

Hi, that's a great and thorough answer. Thanks for taking the time to explain the metacharacters too and to guess that the bracket after HS/LS isn't the standard case (it isn't)

And if I exclude the 'match' operator, the reason regexp returns [1] is because the start of that pattern begins at the start of the string?

strtrim is a good suggestion too. Thanks again :)

Geoff Hayes 2014-11-13

Glad to be able to help, Marshall. And yes, the [1] is returned when you remove the 'match' option because [1] is the start index of the pattern.

请先登录，再进行评论。

Regexp to extract all characters in a varied string up to match.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Regexp to extract all characters in a varied string up to match.

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论 显示 无隐藏 无

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无