Matching groups in a Python regex lookahead -
i have ~raw download of text data wordpress blog, structured follows:
post_id_1 title_1 date_1 text first post .. post_id_2 title_2 date_2 text second post ..
i wrote regex capture post_id
, title
, , date
. goal create python dictionary structured as:
posts = {'date_1': {'post_id': post_id_1, 'title': title_1, 'text': 'this text first post ..' } }
the regex capture headers (post_id
, title
, date
) follows:
header_regex_raw = r"""(\d+)\s(.*(?=january|february|march|april|may|june|july|august|september|october|november|december))(january|february|march|april|may|june|july|august|september|october|november|december)(\s\d+\,\s\d{4}\b)"""
my thought re.findall(header_regex_raw + (.*(?={})).format(header_regex_raw)
, unfortunately doesn't work planned.
how capture multiple groups in lookahead? what's better way create above dict?
i found clean function in python re
module: re.split
.
header_regex_raw = r"""(\d+)\s(.+?(?=january|february|march|april|may|june|july|august|september|october|november|december))((january|february|march|april|may|june|july|august|september|october|november|december)(\s\d+\,\s\d{4}\b))""" header_text_header = re.compile(header_regex_raw) ret = header_text_header.split(data.strip())
this want: captures header elements in groups, text follows in group, following header elements in groups, etc.
Comments
Post a Comment