Is there a limit on the number of side outputs in Google Cloud Dataflow? -
we have cloud dataflow job takes in bigquery table, transforms , writes each record out different table depending on month/year in timestamp record. when run our job on table 12 months of data there should 12 output tables. first month main output , other 11 months side outputs.
we have found job fail when run on 10 or more months(9 side outputs).
is limit on cloud dataflow or bug?
i noticed in execution graph when running more 8 side outputs of outputs said "running" didn't seem writing records.
here of our job ids:
2015-06-14_23_58_06-14457541029573485807 (8 side outputs - passed)
2015-06-14_23_48_43-15277609445992188388 (9 side outputs - failed)
2015-06-14_23_11_46-10500077558949649888 (7 side outputs - passed)
2015-06-14_22_38_48-1428211312699949403 (3 side outputs - passed)
2015-06-14_21_44_27-16273252623089185131 (11 side outputs - failed)
this code processes data. there no caching involved. (tressoutputmanager holds cache of tupletag<tablerow>)
public class tressdenormalizationdofn extends dofn<tablerow, tablerow> { @inject @named("tress.mappers") private set<cptmapper> mappers; @inject private tressoutputmanager tuples; @override public void processelement(processcontext c) throws exception { tablerow row = c.element().clone(); (cptmapper mapper : mappers) { string mapped = mapper.map((string) row.get("event")); if (mapped != null) { row.set(mapper.getid(), mapped); } } // places record in correct month based on time stamp string timestamp = (string) row.get("time_local"); if(timestamp != null){ timestamp = timestamp.substring(0, 7).replaceall("-", "_"); if (tuples.ismainoutput(timestamp)) { c.output(row); } else { c.sideoutput(tuples.gettuple(timestamp), row); } } } }
Comments
Post a Comment