Tricky Regexp Problem, How to specify a slightly varying pattern

Question

0 个投票

Hello all,

I am working with some files that have weird/dumb naming conventions and within the file's name, I need to extract some info so some other stuff I'm doing for a school database project.

The name of a typical file looks something like this:

G001E94A13g1_NM_MAE1317.xls

Where the combination of letters and numbers allow for users to know at a glance the student's name, major, and department among other things.

I want to be able to extract only G001E94A13 from the file name as the g1 is typically associated with something else that isn't necessary to know in the project's scope.

I am hoping to apply this regexp to an entire catalog of these files where the naming convention varies slightly. For example, some students will have two last names and other info that changes this "at a glance" file name. For reference it could look something like:

JG002C96E15X_TX_EE1317.xls

And all I would want is JG002C96E15. The big change is the extra last name that gives the initials JG and the big X after 15. The commonality is that the naming conventions don't change drastically enough to where I can always say that the first portion of the file name that I want will always end with the last two digits of the year the student started the program (15 in this one and 13 in the last one).

I am hoping there is a way to tell the regexp to grab the letters and numbers starting at the beginning of the file name, stopping at the first _, while not grabbing _ or the first set of numbers/characters to the left of _ but keeping the next number that comes after the first character left of the _.

Sounds convoluted but hopefully this explains what I'm trying to do. I'm thinking it might have to be two regexps or something.

What I have tried is:

fileName(1) = G001E94A13g1_NM_MAE1317.xls;
splitStr = regexp(fileName(1),'\_','split');
[a b] = regexpr(splitStr(1),'^([a-Az-0-9]+)_','match')

And I get:

a = G001E94A13g1

But I want:

a = G001E94A13

Thank you in advance and please let me know if more clarification is needed.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Walter Roberson 2018-7-9

Is it 1 or more uppercase alphabetic letters, followed by upper case hex, ending in the first non-hex, with you wanting to stop just before that non-hex ?

L 2018-7-9

在 MATLAB Online 中打开

@Walter I believe the answer to your question is yes.

@OCDER [1-2 letters][numbers][letter][numbers][letter][numbers][IGNORE 1-3 char]_[IGNORE]

The only variances would be that sometimes people forgot to keep things all caps so something like:

SJ003h97I16jg1_NM_CIE1317.xls ==> SJ003h97I16

To be honest, from what I've seen, sifting through the files, this seems to be what it is like. If I see anything other than what has been described than I'll post it. I appreciate both of your help!

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Guillaume 2018-7-9

编辑：Guillaume 2018-7-9

在 MATLAB Online 中打开

2 个投票

If I understood correctly, the pattern you want to match is:

one or more uppercase character at the start: ^[A-Z]+
followed by three digits: \d{3}
followed by a single uppercase character: [A-Z]
followed by two digits: \d{2}
followed by another single uppercase character: [A-Z]
followed by another two digits: \d{2}

So:

match = regexp(filename, '^[A-Z]+\d{3}[A-Z]\d{2}[A-Z]\d{2}', 'match', 'once')

9 个评论
显示 7更早的评论隐藏 7更早的评论

L 2018-7-10

编辑：L 2018-7-10

在 MATLAB Online 中打开

I found another nuance (I swear this should be the last one!);

N004TR92I013nr01_TX_IEE1317.xls ==> N004TR92I013

An extra character in the mix. I've fiddled around with [A-Z,A-Z] but that didn't work.

Additionally, is there any in depth documentation on to increase me knowledge of regexp? How did you guys learn to throw those combinations of characters together?

OCDER 2018-7-10

在 MATLAB Online 中打开

match = regexp(filename, '^[A-Z]+\d{3}[A-Z]+\d{2}[A-Z]\d{2}', 'match', 'once')

The "+" is for 1 or more, so [A-Z]+ is 1 or more alphabet from A-Z.

For any help of matlab function, type

 > help regexp
 > doc regexp

Or do an web search for it.

请先登录，再进行评论。

Tricky Regexp Problem, How to specify a slightly varying pattern

3 个评论
显示 1更早的评论隐藏 1更早的评论

采纳的回答

9 个评论
显示 7更早的评论隐藏 7更早的评论

更多回答（0 个）

类别

产品

版本

标签

Community Treasure Hunt

Tricky Regexp Problem, How to specify a slightly varying pattern

3 个评论 显示 1更早的评论 隐藏 1更早的评论

采纳的回答

9 个评论 显示 7更早的评论 隐藏 7更早的评论

更多回答（0 个）

类别

产品

版本

标签

另请参阅

Community Treasure Hunt

3 个评论
显示 1更早的评论隐藏 1更早的评论

9 个评论
显示 7更早的评论隐藏 7更早的评论