Regexp to extract all characters in a varied string up to match.

4 次查看(过去 30 天)
Hello userbase,
I'm new to regexes. I'm working with some transistor test data and trying to extract information from .csv file names for sorting prior to further probing.
They have often a format such as this:
target = Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv
target = Some Other Test [123456_LS (further info including dates and temperatures)].csv
I want to extract the entire string up to the HS variant, including the optional number that follows it, as this represents the device and test. The further info relates to parameters.
The Some Test Performed section can be single or multiple words, contain special characters (&-_).
I'm looking for HS, LS, HS1, HS2, HS3, LS1, LS2, LS3.
I've tried lookbehind assertions, but it feels cludgy and I've guessed a bit:
pattern = '(?<=((HS)|(HS)\d|(LS)|(LS)\d))\s'
How can I improve this?
What does the ? normally do? (I see that here is a special case for the lookaround.)
My desired regexp(target, pattern, 'match') output would be:
match = Some Test Performed [12345678987_HS1
match = Some Other Test [123456_HS
Or at least the index of the final character so I could use target{1:match} to extract my string. Is there some useful 'from start or target until match' metacharacter?
Best regards and thanks for reading, Marshall

采纳的回答

Geoff Hayes
Geoff Hayes 2014-11-12
Marshall - if all of your target strings (the csv filenames) have an open bracket *(* in them, and you want all the characters before that, then you could use a strfind call to get the index of the open bracket, and then copy all characters up to that index. Something like
target = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
idx = strfind(target, '(');
if ~isempty(idx)
match = strtrim(target(1:idx-1));
end
which would return
match =
Some Test Performed [12345678987_HS1
However, if the open bracket rule is not valid for all cases, then you could try simplifying your pattern to
pattern = '.+[HL]S[\d\s]';
where
.+ means match on one or more single characters including whitespace (the plus sign means one or more);
[HL]S means a single character match on either H or L followed by an S; and
[\d\s] means match on either a single numeric character or any whitespace character.
So with your two target strings above, using this pattern we would see
target1 = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
target2 = 'Some Other Test [123456_LS (further info including dates and temperatures)].csv';
pattern = '.+[HL]S[\d\s]';
match1 = regexp(target1,pattern,'match');
match2 = regexp(target2,pattern,'match');
with
match1 =
'Some Test Performed [12345678987_HS1'
match2 =
'Some Other Test [123456_LS '
A problem with the above pattern may occur when there are additional HS or LS characters that follow the first pattern match. For example, if your target is
target3 = 'Some HS Test Performed [12345678987_HS1 (further info including dates and HS temperatures)].csv';
match3 = regexp(target3,pattern,'match')
then string is found to be
match3 =
'Some HS Test Performed [12345678987_HS1 (further info including dates and HS '
So you may want to narrow down the pattern to that where a numeric string followed by an underscore precedes your original pattern
newPattern = '.+\d+_[HL]S[\d\s]';
match3 = regexp(target3,newPattern,'match')
which returns the desired
match3 =
'Some HS Test Performed [12345678987_HS1'
This new pattern will work for the other two targets as well.
Note that for the second match, we have a trailing whitespace character. You may want to wrap your regexp with a strtrim to remove it.
  2 个评论
Marshall
Marshall 2014-11-13
Hi, that's a great and thorough answer. Thanks for taking the time to explain the metacharacters too and to guess that the bracket after HS/LS isn't the standard case (it isn't)
And if I exclude the 'match' operator, the reason regexp returns [1] is because the start of that pattern begins at the start of the string?
strtrim is a good suggestion too. Thanks again :)
Geoff Hayes
Geoff Hayes 2014-11-13
Glad to be able to help, Marshall. And yes, the [1] is returned when you remove the 'match' option because [1] is the start index of the pattern.

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Characters and Strings 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by