Matching groups in a Python regex lookahead -


i have ~raw download of text data wordpress blog, structured follows:

post_id_1 title_1 date_1  text first post ..  post_id_2 title_2 date_2  text second post .. 

i wrote regex capture post_id, title, , date. goal create python dictionary structured as:

posts = {'date_1': {'post_id': post_id_1,                     'title': title_1,                     'text': 'this text first post ..'                     }         } 

the regex capture headers (post_id, title, date) follows:

header_regex_raw = r"""(\d+)\s(.*(?=january|february|march|april|may|june|july|august|september|october|november|december))(january|february|march|april|may|june|july|august|september|october|november|december)(\s\d+\,\s\d{4}\b)""" 

my thought re.findall(header_regex_raw + (.*(?={})).format(header_regex_raw), unfortunately doesn't work planned.

how capture multiple groups in lookahead? what's better way create above dict?

i found clean function in python re module: re.split.

header_regex_raw = r"""(\d+)\s(.+?(?=january|february|march|april|may|june|july|august|september|october|november|december))((january|february|march|april|may|june|july|august|september|october|november|december)(\s\d+\,\s\d{4}\b))""" header_text_header = re.compile(header_regex_raw) ret = header_text_header.split(data.strip()) 

this want: captures header elements in groups, text follows in group, following header elements in groups, etc.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -