python - LXML - parse td content within tr tag -
i want parse each individual statistic yahoo finance tables formatting purposes - when parsing entire table formatting terrible!! using code below , have repeat 4 lines of contenta code altered retrieve stats within each row of table. exemplified in contentb variables below. refuse believe efficient way so. suggestions?
from lxml import html url = 'http://finance.yahoo.com/q/is?s=mmm+income+statement&annual' tree = html.parse(url) contenta = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[1]")[0].text_content().strip() contenta1 = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[2]")[0].text_content().strip() contenta2 = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[3]")[0].text_content().strip() contenta3 = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[4]")[0].text_content().strip() contentb = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[1]")[0].text_content().strip() contentb1 = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[2]")[0].text_content().strip() contentb2 = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[3]")[0].text_content().strip() contentg3 = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[4]")[0].text_content().strip()
use range
, format
for in range(1,5): contenta = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[{i}]".format(i=i))[0].text_content().strip() print(contenta)
output
total revenue 31,821,000 30,871,000 29,904,000
for in range(1,5): contentb = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[{i}]".format(i=i))[0].text_content().strip() print(contentb)
output
cost of revenue 16,447,000 16,106,000 15,685,000
edit
in [22]: d = {} in [23]: d.setdefault('revenue', []) out[23]: [] in [24]: in range(2,5): ....: contentb = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[{i}]".format(i=i))[0].text_content().strip() ....: d['revenue'].append(int(contentb.replace(',', ''))) ....: in [25]: d out[25]: {'revenue': [16447000, 16106000, 15685000]}
Comments
Post a Comment