Quick-Start Guide¶
pyvttbl has two primary containers for storing data. DataFrame
objects
hold tabulated data in a manner similiar to what you would have in a
spreadsheet. DataFrames are basically dictionary objects. The keys
are the names of the columns and the values are NumPy arrays or
NumPy MaskedArrays if missing data is specified or encountered when
reading datafiles.
DataFrames can be pivoted to created PyvtTbl
objects. These are
subclasses of NumPy.MaskedArray
. The use of MaskedArray
helps to make
PyvtTbl
objects robust to invalid or missing data.
Loading Data into a DataFrame
¶
Before we can start building contingency tables we need to first load the
data into a DataFrame
object. Data can be read from plaintext datafiles
using the DataFrame
. read_tbl()
method or inserted one row at a time using
DataFrame
. insert()
. DataFrames can be attached to one another using
DataFrame
. attach()
. The read_tbl method is looking for comma separated values
(CSV) files. Other delimited file types should also work; just specify the
delimiter. You can also specify whether your file has labels (default is
True) and the number of lines at the beginning to skip (Default is 0).
>>> from __future__ import print_function
>>>
>>> from pyvttbl import DataFrame
>>> df = DataFrame()
>>> df.read_tbl('example.csv')
>>> print(df)
CASE TIME CONDITION X
==============================
1 day A 102
2 day B 70
3 night A 27
4 night B 38
5 day A 67
6 day B 127
7 night A 68
8 night B 49
9 day A 57
10 day B 71
11 night A 40
12 night B 84
>>>
Manipulating Data in a DataFrame
¶
The DataFrame
class inherits collections.OrderedDict. The __str__()
method
is defined to output pretty looking text tables. Because the DataFrame
holds
NumPy arrays manipulating data is fairly straight forward.
>>> import numpy as np
>>> df['LOG10_X'] = np.log10(df['X'])
>>> print(df)
CASE TIME CONDITION X LOG10_X
========================================
1 day A 102 2.009
2 day B 70 1.845
3 night A 27 1.431
4 night B 38 1.580
5 day A 67 1.826
6 day B 127 2.104
7 night A 68 1.833
8 night B 49 1.690
9 day A 57 1.756
10 day B 71 1.851
11 night A 40 1.602
12 night B 84 1.924
>>>
Sorting a DataFrame
¶
DataFrame
has an inplace sort()
method. You just have to specify the
variables you want to sort by.
>>> df.sort(['TIME','CONDITION'])
>>> print(df)
CASE TIME CONDITION X LOG10_X
========================================
1 day A 102 2.009
5 day A 67 1.826
9 day A 57 1.756
2 day B 70 1.845
6 day B 127 2.104
10 day B 71 1.851
3 night A 27 1.431
7 night A 68 1.833
11 night A 40 1.602
4 night B 38 1.580
8 night B 49 1.690
12 night B 84 1.924
>>>
You can also reverse the order by using ‘DESC’ or ‘desc’ after the name of the variable.
>>> df.sort(['TIME DESC','CONDITION DESC'])
>>> print(df)
CASE TIME CONDITION X LOG10_X
========================================
4 night B 38 1.580
8 night B 49 1.690
12 night B 84 1.924
3 night A 27 1.431
7 night A 68 1.833
11 night A 40 1.602
2 day B 70 1.845
6 day B 127 2.104
10 day B 71 1.851
1 day A 102 2.009
5 day A 67 1.826
9 day A 57 1.756
Writing Data from a DataFrame
¶
Tables can be exported using the write method. If filename is not supplied (as a fname keyword argument) a filename will be generated from the column labels.
>>> df.write()
>>> df.read_tbl('caseXtimeXconditionXxXlog0_x.csv')
>>> print(df)
CASE TIME CONDITION X LOG10_X
========================================
1 day A 102 2.009
2 day B 70 1.845
3 night A 27 1.431
4 night B 38 1.580
5 day A 67 1.826
6 day B 127 2.104
7 night A 68 1.833
8 night B 49 1.690
9 day A 57 1.756
10 day B 71 1.851
11 night A 40 1.602
12 night B 84 1.924
>>>
Pivoting Data¶
Once data is in a DataFrame
pivoting is as simple as:
>>> pt = df.pivot('LOG10_X', ['TIME'], ['CONDITION'])
>>> print(pt)
avg(LOG10_X)
TIME CONDITION=A CONDITION=B
=================================
day 1.864 1.933
night 1.622 1.731
>>>
Calling pivot returns a PyvtTbl
object.
Multidimensional Pivoting¶
The example above isn’t all that impressive. The PyvtTbl
class can also pivot higher dimensional data.
>>> df.read_tbl('suppression~subjectXgroupXageXcycleXphase.csv')
>>> pt = df.pivot('SUPPRESSION',
rows=['CYCLE', 'PHASE'],
cols=['GROUP', 'AGE'])
>>> print(pt)
avg(SUPPRESSION)
CYCLE PHASE GROUP=AA, GROUP=AA, GROUP=AB, GROUP=AB, GROUP=LAB, GROUP=LAB,
AGE=old AGE=young AGE=old AGE=young AGE=old AGE=young
=======================================================================================
1 I 17.750 8.675 12.625 5.525 21.625 7.825
1 II 20.875 8.300 22.750 8.675 36.250 13.750
2 I 22.375 10.225 23.500 8.825 21.375 9.900
2 II 28.125 10.250 41.125 13.100 46.875 14.375
3 I 23.125 10.500 20 9.125 23.750 9.500
3 II 20.750 9.525 46.125 14.475 50.375 15.575
4 I 20.250 9.925 15.625 7.750 26.375 9.650
4 II 24.250 11.100 51.750 12.850 46.500 14.425
>>>
Pivot Aggregate Functions¶
If no aggregate keyword is supplied the pivoting will result in averages of the underlying data. A variety of other aggregators can also be applied:
Aggregate |
Description |
---|---|
abs_mean(X) |
mean of the absolute values of X |
avg(X) |
average value of all non-NULL X within a group |
arbitrary(X) |
an arbitrary element of X |
count(X) |
count of the number of times that X is not NULL in a group |
ci(X) |
95% confidence interval of X |
datarange(X) |
range of X |
geometric_mean(X) |
geometric mean of X |
group_concat(X) |
concatenates the values of X as elements in a list |
hasinf(X) |
True if X contains any inf values |
hasnan(X) |
True if X contains any nan values |
kurt(X) |
sample kurtosis estimate of X |
kurtp(X) |
population kurtosis estimate of X |
median(X) |
median of X |
mode(X) |
mode of X |
prod(X) |
product of the elements of X |
rms(X) |
root mean square of X |
sem(X) |
standard error of the mean of X |
skew(X) |
sample skewness estimate of X (N-1) |
skewp(X) |
population skewness estimate of X (N) |
stdev(X) |
standard deviation estimate of the samples in X (N-1) |
stdevp(X) |
standard deviation of the population X (N) |
tolist(X) |
puts the values of X in a list (as 3 dimensional PyvtTbl) |
var(X) |
variance estimate of the samples in X (N-1) |
varp(X) |
variance of the population X (N) |
The pyvttbl module takes advantage of aggregate functions in from pystaggrelite3.
You can also bind your own using DataFrame
. bind_aggregate()
.
>>> pt = df.pivot('SUPPRESSION',
rows=['CYCLE', 'PHASE'],
cols=['GROUP', 'AGE'],
aggregate='count')
>>> print(pt)
count(SUPPRESSION)
CYCLE PHASE GROUP=AA, GROUP=AA, GROUP=AB, GROUP=AB, GROUP=LAB, GROUP=LAB,
AGE=old AGE=young AGE=old AGE=young AGE=old AGE=young
=======================================================================================
1 I 8 8 8 8 8 8
1 II 8 8 8 8 8 8
2 I 8 8 8 8 8 8
2 II 8 8 8 8 8 8
3 I 8 8 8 8 8 8
3 II 8 8 8 8 8 8
4 I 8 8 8 8 8 8
4 II 8 8 8 8 8 8
>>>
Example using the ‘tolist’ aggregator¶
>>> pt = df.pivot('SUPPRESSION',
rows=['CYCLE', 'PHASE','GROUP', 'AGE'],
aggregate='tolist')
>>> print(pt)
tolist(SUPPRESSION)
CYCLE PHASE GROUP AGE Value
================================================================================
1 I AA old [1.0, 37.0, 18.0, 1.0, 44.0, 15.0, 0.0, 26.0]
1 I AA young [6.2, 16.4, 7.6, 1.2, 13.8, 13.0, 1.0, 10.2]
1 I AB old [1.0, 21.0, 15.0, 30.0, 11.0, 16.0, 7.0, 0.0]
1 I AB young [3.2, 9.2, 3.0, 9.0, 6.2, 11.2, 2.4, 0.0]
1 I LAB old [33.0, 4.0, 32.0, 17.0, 44.0, 12.0, 18.0, 13.0]
1 I LAB young [15.6, 8.8, 7.4, 4.4, 8.8, 5.4, 4.6, 7.6]
1 II AA old [6.0, 59.0, 43.0, 2.0, 25.0, 14.0, 3.0, 15.0]
1 II AA young [4.2, 21.8, 15.6, 0.4, 6.0, 12.8, 0.6, 5.0]
1 II AB old [28.0, 21.0, 17.0, 34.0, 23.0, 11.0, 26.0, 22.0]
1 II AB young [5.6, 14.2, 5.4, 15.8, 5.6, 4.2, 5.2, 13.4]
1 II LAB old [43.0, 35.0, 39.0, 34.0, 52.0, 16.0, 42.0, 29.0]
1 II LAB young [14.6, 15.0, 13.8, 14.8, 20.4, 3.2, 16.4, 11.8]
2 I AA old [16.0, 28.0, 38.0, 9.0, 28.0, 22.0, 7.0, 31.0]
2 I AA young [11.2, 10.6, 14.6, 7.8, 10.6, 5.4, 7.4, 14.2]
2 I AB old [22.0, 16.0, 13.0, 55.0, 12.0, 18.0, 29.0, 23.0]
2 I AB young [12.4, 9.2, 2.6, 12.0, 11.4, 3.6, 14.8, 4.6]
2 I LAB old [40.0, 9.0, 38.0, 21.0, 37.0, 9.0, 3.0, 14.0]
2 I LAB young [12.0, 7.8, 16.6, 6.2, 9.4, 9.8, 8.6, 8.8]
2 II AA old [8.0, 36.0, 50.0, 8.0, 42.0, 32.0, 17.0, 32.0]
2 II AA young [7.6, 10.2, 19.0, 8.6, 9.4, 8.4, 11.4, 7.4]
2 II AB old [48.0, 40.0, 35.0, 54.0, 33.0, 34.0, 40.0, 45.0]
2 II AB young [14.6, 11.0, 15.0, 13.8, 11.6, 11.8, 13.0, 14.0]
2 II LAB old [52.0, 42.0, 47.0, 41.0, 48.0, 39.0, 62.0, 44.0]
2 II LAB young [17.4, 17.4, 14.4, 8.2, 14.6, 17.8, 12.4, 12.8]
3 I AA old [9.0, 34.0, 39.0, 6.0, 47.0, 16.0, 6.0, 28.0]
3 I AA young [6.8, 13.8, 13.8, 6.2, 19.4, 12.2, 1.2, 10.6]
3 I AB old [22.0, 15.0, 22.0, 37.0, 10.0, 11.0, 25.0, 18.0]
3 I AB young [4.4, 9.0, 9.4, 14.4, 3.0, 12.2, 12.0, 8.6]
3 I LAB old [39.0, 4.0, 24.0, 27.0, 33.0, 9.0, 45.0, 9.0]
3 I LAB young [12.8, 7.8, 8.8, 8.4, 9.6, 6.8, 13.0, 8.8]
3 II AA old [14.0, 32.0, 15.0, 5.0, 46.0, 23.0, 9.0, 22.0]
3 II AA young [9.8, 10.4, 6.0, 2.0, 17.2, 6.6, 11.8, 12.4]
3 II AB old [50.0, 39.0, 45.0, 57.0, 50.0, 40.0, 50.0, 38.0]
3 II AB young [10.0, 8.8, 14.0, 21.4, 14.0, 18.0, 16.0, 13.6]
3 II LAB old [52.0, 46.0, 44.0, 50.0, 53.0, 59.0, 49.0, 50.0]
3 II LAB young [20.4, 12.2, 17.8, 10.0, 16.6, 19.8, 10.8, 17.0]
4 I AA old [11.0, 26.0, 29.0, 5.0, 33.0, 32.0, 10.0, 16.0]
4 I AA young [3.2, 14.2, 11.8, 10.0, 8.6, 13.4, 5.0, 13.2]
4 I AB old [14.0, 11.0, 1.0, 57.0, 8.0, 5.0, 14.0, 15.0]
4 I AB young [6.8, 6.2, 5.2, 11.4, 4.6, 6.0, 9.8, 12.0]
4 I LAB old [38.0, 23.0, 16.0, 13.0, 33.0, 13.0, 60.0, 15.0]
4 I LAB young [7.6, 7.6, 10.2, 9.6, 7.6, 6.6, 21.0, 7.0]
4 II AA old [33.0, 37.0, 18.0, 15.0, 35.0, 26.0, 15.0, 15.0]
4 II AA young [10.6, 16.4, 9.6, 9.0, 16.0, 8.2, 8.0, 11.0]
4 II AB old [48.0, 56.0, 43.0, 68.0, 53.0, 40.0, 56.0, 50.0]
4 II AB young [15.6, 14.2, 10.6, 15.6, 11.6, 8.0, 16.2, 11.0]
4 II LAB old [48.0, 51.0, 40.0, 40.0, 43.0, 45.0, 57.0, 48.0]
4 II LAB young [11.6, 16.2, 8.0, 17.0, 12.6, 18.0, 16.4, 15.6]
>>> pt[0,0,0]
1.0
>>> pt[-1,-1,-1]
15.6
>>>
Note
Numpy ndarray objects and their subclasses need to have an equal number of elements at their greatest depth. If more elements meet the contingency criteria in one cell compared to another the cells are padded with masked values to the dimension with the greatest number of elements.
for example:
[[1, 2], [3, 4, 5]]would become:
[[1, 2, --], [3, 4, 5]]
Manipulating PyvtTbl
objects¶
Mathematical Operations¶
PyvtTbl
can be added, subtracted, multiplied by constants or other PyvtTbl
objects
of equivalent shape.
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> pt = df.pivot('ERROR', ['TIMEOFDAY'], ['COURSE'])
>>> print(pt)
avg(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3 Total
=====================================================
T1 7.167 6.500 4 5.619
T2 3.222 2.889 1.556 2.556
=====================================================
Total 4.800 4.333 2.778 3.896
>>> pt2=pt+5
>>> print(pt2)
avg(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3 Total
======================================================
T1 12.167 11.500 9 10.619
T2 8.222 7.889 6.556 7.556
======================================================
Total 9.800 9.333 7.778 8.896
>>> sums = df.pivot('ERROR', ['TIMEOFDAY'], ['COURSE'],
aggregate='sum')
>>> counts = df.pivot('ERROR', ['TIMEOFDAY'], ['COURSE'],
aggregate='count')
>>> print(sums/counts.astype(np.float64))
N/A(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3 Total
=====================================================
T1 7.167 6.500 4 5.619
T2 3.222 2.889 1.556 2.556
=====================================================
Total 4.800 4.333 2.778 3.896
The operations are applied to the row and column totals when possible. However, operations like np.sum behave as expected.
>>> import numpy as np
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> counts = df.pivot('ERROR', ['TIMEOFDAY'], ['COURSE'],
aggregate='count')
>>> print(counts)
>>>
count(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3 Total
=====================================================
T1 6 6 9 21
T2 9 9 9 27
=====================================================
Total 15 15 18 48
>>> print(np.sum(counts))
48
PyvtTbl
. transpose()
¶
transpose()
returns a transposed copy of the table
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> pt = df.pivot('ERROR', ['TIMEOFDAY'], ['COURSE'],
aggregate='count')
>>> print(pt)
count(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3 Total
=====================================================
T1 6 6 9 21
T2 9 9 9 27
=====================================================
Total 15 15 18 48
>>> print(pt.transpose())
count(ERROR)
COURSE TIMEOFDAY=T1 TIMEOFDAY=T2 Total
============================================
C1 6 9 15
C2 6 9 15
C3 9 9 18
============================================
Total 21 27 48
>>>
PyvtTbl
. flatten()
¶
flatten()
returns a flattened copy of the table
as a MaskedArray
>>> # continued from above
>>> print(pt.flatten())
[6 9 6 9 9 9]
>>> print(type(pt.flatten()))
<class 'numpy.ma.core.MaskedArray'>
Iterating over PyvtTbl
objects¶
The default iteration method iterates over the first index of the PyvtTbl
for L in pt <==> for pt[i] in xrange(pt.shape[0])
PyvtTbl
. __iter__()
¶
>>> from __future__ import print_function
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> pt = df.pivot('ERROR', ['TIMEOFDAY'],['COURSE'])
>>> for L in pt:
print(L)
print()
avg(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3
=============================================
T1 7.167 6.500 4
avg(ERROR)
TIMEOFDAY COURSE=C1 COURSE=C2 COURSE=C3
=============================================
T2 3.222 2.889 1.556
>>>
PyvtTbl
. flat¶
flat is technically a property and not method so it is called without the “()”
>>> from __future__ import print_function
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> pt = df.pivot('ERROR', ['TIMEOFDAY'],['COURSE'])
>>> for L in pt.flat:
print(L)
7.16666666667
6.5
4.0
3.22222222222
2.88888888889
1.55555555556
>>>
PyvtTbl
. ndenumerate()
¶
ndenumerate()
returns an iterator yielding pairs of array coordinates and values.
>>> from __future__ import print_function
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> pt = df.pivot('ERROR', ['TIMEOFDAY'],['COURSE'])
>>> for (rind,cind),L in pt.ndenumerate():
print((rind,cind), L)
(0, 0) 7.16666666667
(0, 1) 6.5
(0, 2) 4.0
(1, 0) 3.22222222222
(1, 1) 2.88888888889
(1, 2) 1.55555555556
>>>
These indices can be used to lookup the row and column labels
>>> from __future__ import print_function
>>> df=DataFrame()
>>> df.read_tbl('data/error~subjectXtimeofdayXcourseXmodel_MISSING.csv')
>>> pt = df.pivot('ERROR', ['TIMEOFDAY','MODEL'],['COURSE'])
>>> print(pt)
avg(ERROR)
TIMEOFDAY MODEL COURSE=C1 COURSE=C2 COURSE=C3 Total
=============================================================
T1 M1 9 8.667 4.667 7.250
T1 M2 7.500 6 5 6
T1 M3 5 3.500 2.333 3.429
T2 M1 4.333 3.667 1.667 3.222
T2 M2 2.667 2.667 1.667 2.333
T2 M3 2.667 2.333 1.333 2.111
=============================================================
Total 4.800 4.333 2.778 3.896
>>> for (rind,cind),L in pt.ndenumerate():
rlabel = ', '.join('%s=%s'%(k,v) for k,v in pt.rnames[rind])
clabel = ', '.join('%s=%s'%(k,v) for k,v in pt.cnames[cind])
print(rlabel, '\t', clabel, '\t', L)
TIMEOFDAY=T1, MODEL=M1 COURSE=C1 9.0
TIMEOFDAY=T1, MODEL=M1 COURSE=C2 8.66666666667
TIMEOFDAY=T1, MODEL=M1 COURSE=C3 4.66666666667
TIMEOFDAY=T1, MODEL=M2 COURSE=C1 7.5
TIMEOFDAY=T1, MODEL=M2 COURSE=C2 6.0
TIMEOFDAY=T1, MODEL=M2 COURSE=C3 5.0
TIMEOFDAY=T1, MODEL=M3 COURSE=C1 5.0
TIMEOFDAY=T1, MODEL=M3 COURSE=C2 3.5
TIMEOFDAY=T1, MODEL=M3 COURSE=C3 2.33333333333
TIMEOFDAY=T2, MODEL=M1 COURSE=C1 4.33333333333
TIMEOFDAY=T2, MODEL=M1 COURSE=C2 3.66666666667
TIMEOFDAY=T2, MODEL=M1 COURSE=C3 1.66666666667
TIMEOFDAY=T2, MODEL=M2 COURSE=C1 2.66666666667
TIMEOFDAY=T2, MODEL=M2 COURSE=C2 2.66666666667
TIMEOFDAY=T2, MODEL=M2 COURSE=C3 1.66666666667
TIMEOFDAY=T2, MODEL=M3 COURSE=C1 2.66666666667
TIMEOFDAY=T2, MODEL=M3 COURSE=C2 2.33333333333
TIMEOFDAY=T2, MODEL=M3 COURSE=C3 1.33333333333
>>>
Note
Unlike numpy
. ndenumerate()
, pyvttbl
. ndenumerate()
will only return the first two indices regardless of the number of dimensions
in the table.
>>> pt = df.pivot('ERROR', ['TIMEOFDAY','MODEL'],['COURSE'],
aggregate='tolist')
>>> for (rind,cind),L in pt.ndenumerate():
rlabel = ', '.join('%s=%s'%(k,v) for k,v in pt.rnames[rind])
clabel = ', '.join('%s=%s'%(k,v) for k,v in pt.cnames[cind])
print(rlabel, '\t', clabel, '\t', L.flatten())
TIMEOFDAY=T1 COURSE=C1 [10.0 8.0 6.0 8.0 7.0 4.0 -- -- --]
TIMEOFDAY=T1 COURSE=C2 [9.0 10.0 6.0 4.0 7.0 3.0 -- -- --]
TIMEOFDAY=T1 COURSE=C3 [7.0 6.0 3.0 4.0 5.0 2.0 3.0 4.0 2.0]
TIMEOFDAY=T2 COURSE=C1 [5.0 4.0 3.0 4.0 3.0 3.0 4.0 1.0 2.0]
TIMEOFDAY=T2 COURSE=C2 [4.0 3.0 3.0 4.0 2.0 2.0 3.0 3.0 2.0]
TIMEOFDAY=T2 COURSE=C3 [2.0 2.0 1.0 2.0 3.0 2.0 1.0 0.0 1.0]
>>>
Converting a PyvtTbl
to a DataFrame
¶
PyvtTbl
. to_dataframe()
will return the pivoted data as a DataFrame
object
>>> import numpy as np
>>> from pyvttbl import DataFrame
>>> df = DataFrame()
>>> df.read_tbl('example.csv')
>>> df['LOG10_X'] = np.log10(df['X'])
>>> pt = df.pivot('LOG10_X', ['TIME'], ['CONDITION'])
>>> df2 = pt.to_dataframe()
>>> print(df2)
TIME CONDITION=A CONDITION=B
=================================
day 1.864 1.933
night 1.622 1.731