python - How to apply functions with multiple arguments on Pandas selected columns data frame -
i have following data frame:
import pandas pd data = {'gene':['a','b','c','d','e'], 'count':[61,320,34,14,33], 'gene_length':[152,86,92,170,111]} df = pd.dataframe(data) df = df[["gene","count","gene_length"]] that looks this:
in [9]: df out[9]: gene count gene_length 0 61 152 1 b 320 86 2 c 34 92 3 d 14 170 4 e 33 111 what want apply function:
def calculate_rpkm(thec,then,thel): """ thec == total reads mapped feature (gene/linc) thel == length of feature (gene/linc) == total reads mapped """ rpkm = float((10**9) * thec)/(then * thel) return rpkm on count , gene_length columns , constant n=12345 , name new result 'rpkm'. why failed?
n=12345 df["rpkm"] = calculate_rpkm(df['count'],n,df['gene_length']) what's right way it? first row should this:
gene count gene_length rpkm 61 152 32508.366 update: error got this:
-------------------------------------------------------------------------- typeerror traceback (most recent call last) <ipython-input-4-6270e1d19b89> in <module>() ----> 1 df["rpkm"] = calculate_rpkm(df['count'],n,df['gene_length']) <ipython-input-1-48e311ca02f3> in calculate_rpkm(thec, then, thel) 13 == total reads mapped 14 """ ---> 15 rpkm = float((10**9) * thec)/(then * thel) 16 return rpkm /u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self) 74 return converter(self.iloc[0]) 75 raise typeerror( ---> 76 "cannot convert series {0}".format(str(converter))) 77 return wrapper 78
don't cast float in method , work fine:
in [9]: def calculate_rpkm(thec,then, thel): """ thec == total reads mapped feature (gene/linc) thel == length of feature (gene/linc) == total reads mapped """ rpkm = ((10**9) * thec)/(then * thel) return rpkm n=12345 df["rpkm"] = calculate_rpkm(df['count'],n,df['gene_length']) df out[9]: gene count gene_length rpkm 0 61 152 32508.366908 1 b 320 86 301411.926493 2 c 34 92 29936.429112 3 d 14 170 6670.955138 4 e 33 111 24082.405613 the error message telling you cannot cast pandas series float, whilst call apply call method row-wise. should @ rewriting method can work on entire series, vectorised , faster calling apply for loop.
timings
in [11]: def calculate_rpkm1(thec,then, thel): """ thec == total reads mapped feature (gene/linc) thel == length of feature (gene/linc) == total reads mapped """ rpkm = ((10**9) * thec)/(then * thel) return rpkm def calculate_rpkm(thec,then,thel): """ thec == total reads mapped feature (gene/linc) thel == length of feature (gene/linc) == total reads mapped """ rpkm = float((10**9) * thec)/(then * thel) return rpkm n=12345 %timeit calculate_rpkm1(df['count'],n,df['gene_length']) %timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_rpkm(x[0], n, x[1]), axis=1) 1000 loops, best of 3: 238 µs per loop 100 loops, best of 3: 1.5 ms per loop you can see non casting version on 6x faster , more performant on larger datasets
update
the following code along using non-casting float version of method semantically equivalent:
df['rpkm'] = calculate_rpkm1(df['count'].astype(float),n,df['gene_length']) df out[16]: gene count gene_length rpkm 0 61 152 32508.366908 1 b 320 86 301411.926493 2 c 34 92 29936.429112 3 d 14 170 6670.955138 4 e 33 111 24082.405613
Comments
Post a Comment