Regex: How can I perform positive lookbehind for a specific sequence of characters?

2 次查看(过去 30 天)
EDIT: Changed 'Negative lookbehind' to 'Positive lookbehind'
Hi,
I am attempting to seperate the first name from a list of names, using regex. The format of the names is as follows:
<last name>, <title>. <first name> <middle names> (<other name>)
Where <middle names> and (<other name>) are optional.
I'm new to regex, and currently finding it hard to intuit. It seems to me that I need a positive lookbehind to capture the word preceded by a '.' followed by a 'whitespace' in order to capture the first names, but its not working how I'd like! See code below:
load titanic.mat
% Attempt #1 (Matches words preceded by'.' characters OR whitespace characters -
% I need it to match '.' followed by a whitespace... how???
name_first = regexp(train.Name, '(?<=[\.\s])([A-Z][a-z]+)', 'match')
% Attempt #2 (Captures unwanted '. ' before first names)
name_first2 = regexp(train.Name, '\.\s([A-Z][a-z]+)', 'match')
% Attempt #2 (Attempt to capture 3rd word, doesn't work)
name_first3 = regexp(train.Name, '(\w.*\w){3}', 'match')
Alternative solutions are great, but ideally I'd like to understand WHY my current code doesn't work (specifically attempt #1), and how I might be able to make it work using the negative lookbehind to lookbehind for a specific sequence of characters (i.e. return a word preceded by 'abc').
Thanks in advance for your help.
  4 个评论
Walter Roberson
Walter Roberson 2021-9-14
编辑:Walter Roberson 2021-9-14
% I need it to match '.' followed by a whitespace... how???
Using
name_first = regexp(train.Name, '(?<=\.\s)([A-Z][a-z]+)', 'match')
But consider making it \s+ instead of \s .
Also, are you sure you do not need to handle names with apostrophe like O'Rorke ? Are you sure you do not need to handle names with dashes, like Fitz-Williams ? Are you sure you do not need to handle surnames with spaces, such as van Horton ? Which, incidentally, is also an example of a name that starts with lower-case.
Adam Brann
Adam Brann 2021-9-14
Thanks for your answer, exactly what I needed. I mistakenly thought the characters to be 'looked behind for' needed to be inside square brackets.
Excellent points regarding the 'unusual' names, I'll go away and have a think about how I might write a regexp to capture those cases. Many thanks for your help.

请先登录,再进行评论。

回答(0 个)

标签

产品


版本

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by