DataFrame Overview¶
Description¶
DataFrame
objects are containers for holding tabulated data. They
inherent collections.OrderedDict
and hold data in
numpy.array
objects.
Public Methods¶
Methods to get data into a DataFrame
, manipulate and manage data,
and write data. For the most part these remove the user from the fact they
are using sqlite3 to manipulate data.
- DataFrame.__init__(*args, **kwds)¶
initialize a
DataFrame
object.Subclass ofcollections
.OrderedDict
.Understands the typical initializationdict
signatures.Keys must be hashable.Values become numpy.arrays or numpy.ma.MaskedArrays.
- DataFrame.__setitem__(key, item, mask=None)¶
assign a column in the table
- args:
key: hashable object to associate with item
item: an iterable that is put in an np.array or np.ma.array
- kwds:
mask: mask value passed to np.ma.MaskedArray.__init__()
- returns:
None
df.__setitem__(key, item) <==> df[key] = itemThe assigned item must be iterable. To add a single row use the insert method. To another table to this one use the attach method.- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> import numpy as np >>> df['log10(age)'] = np.log10(df['age']) >>> print(df) first last age gender log10(age) =============================================== Roger Lew 28 male 1.447 Bosco Robinson 5 male 0.699 Megan Whittington 26 female 1.415 John Smith 51 male 1.708 Jane Doe 49 female 1.690 >>>
- DataFrame.__delitem__(key)¶
delete a column from the table
- args:
key: associated with the item to delete
- returns:
None
df.__delitem__(key) <==> del df[key]- example:
>>> ... >>> print(df) first last age gender log10(age) =============================================== Roger Lew 28 male 1.447 Bosco Robinson 5 male 0.699 Megan Whittington 26 female 1.415 John Smith 51 male 1.708 Jane Doe 49 female 1.690 >>> del df['log10(age)'] >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>>
- DataFrame.__str__()¶
returns human friendly string representation of object
- args:
None
- returns:
string with easy to read representation of table
df.__str__() <==> str(df)
- DataFrame.read_tbl(fname, skip=0, delimiter=',', labels=True)¶
loads tabulated data from a plain text file
- args:
fname: path and name of datafile
- kwds:
skip: number of lines to skip before looking for column labels. (default = 0)
delimiter: string to seperate values (default = “’”)
labels: bool specifiying whether first row (after skip) contains labels. (default = True)
- returns:
None
Checks and renames duplicate column labels as well as checkingfor missing cells. readTbl will warn and skip over missing lines.
- DataFrame.attach(other)¶
attaches a second
DataFrame
to self- args:
other: a
DataFrame
object whose key set matches self- return:
None
- DataFrame.insert(row)¶
insert a row into the table
- args:
row: should be mappable. e.g. a dict or a list with key/value pairs.
- returns:
None
- example:
>>> from pyvttbl import DataFrame >>> from collections import namedtuple >>> Person = namedtuple('Person',['first','last','age','gender']) >>> df =DataFrame() >>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict()) >>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict()) >>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict()) >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female >>>
- DataFrame.sort(order=None)¶
sort the table in-place
- kwds:
order: is a list of factors to sort by to reverse order append ” desc” to the factor
- returns:
None
- example:
>>> from pyvttbl import DataFrame >>> from collections import namedtuple >>> Person = namedtuple('Person',['first','last','age','gender']) >>> df =DataFrame() >>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict()) >>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict()) >>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict()) >>> df.insert(Person('John', 'Smith', 51, 'male')._asdict()) >>> df.insert(Person('Jane', 'Doe', 49, 'female')._asdict()) >>> df.sort(['gender', 'age']) >>> print(df) first last age gender ================================== Megan Whittington 26 female Jane Doe 49 female Bosco Robinson 5 male Roger Lew 28 male John Smith 51 male >>>
- DataFrame.select_col(key, where=None)¶
determines rows in table that satisfy the conditions given by where and returns the values of key in the remaining rows
- args:
key: column label of data to return
- kwds:
where: constraints to apply to table before returning data
- returns:
a list
- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> df.select_col('age', where='gender == "male"') [28, 5, 51] >>>
- DataFrame.row_iter()¶
iterate over the rows in table
- args:
None
- returns:
iterator that yields OrderedDict objects with (key,value) pairs cooresponding to the data in each row
- example:
>>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 male John Smith 51 female Jane Doe 49 female >>> for case in df.row_iter(): print(case) OrderedDict([('first', 'Roger'), ('last', 'Lew'), ('age', 28), ('gender', 'male')]) OrderedDict([('first', 'Bosco'), ('last', 'Robinson'), ('age', 5), ('gender', 'male')]) OrderedDict([('first', 'Megan'), ('last', 'Whittington'), ('age', 26), ('gender', 'male')]) OrderedDict([('first', 'John'), ('last', 'Smith'), ('age', 51), ('gender', 'female')]) OrderedDict([('first', 'Jane'), ('last', 'Doe'), ('age', 49), ('gender', 'female')]) >>>
- DataFrame.pivot(val, rows=None, cols=None, aggregate='avg', where=None, attach_rlabels=False, method='valid')¶
produces a contingency table according to the arguments and keywords provided.
- args:
val: the colname to place as the data in the table
- kwds:
- rows: list of colnames whos combinations will become rows
in the table if left blank their will be one row
- cols: list of colnames whos combinations will become cols
in the table if left blank their will be one col
- aggregate: function applied across data going into each cell
of the table <http://www.sqlite.org/lang_aggfunc.html>_
where: list of tuples or list of strings for filtering data
- method:
‘valid’: only returns rows or columns with valid entries.
- ‘full’: return full factorial combinations of the
conditions specified by rows and cols
- returns:
PyvtTbl
object
- DataFrame.where(where)¶
Applies the where filter to a copy of the DataFrame, and returns the new DataFrame. The associated DataFrame is not copied.
- args:
where: criterion to apply to new table
- returns:
a new
DataFrame
- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> print(df.where('age > 20 and age < 45')) first last age gender ================================== Roger Lew 28 male Megan Whittington 26 female >>>
- DataFrame.where_update(where)¶
Applies the where filter in-place.
- args:
where: criterion to apply to table
- returns:
None
- DataFrame.summary(where=None)¶
prints the descriptive information for each column in DataFrame
- kwds:
where: criterion to apply to table before running analysis
- returns:
None
- DataFrame.validate(criteria, verbose=False, report=False)¶
validate the data in the table.
- args:
criteria: a dict whose keys should coorespond to columns in the table. The values should be functions which take a single parameter and return a boolean.
- kwds:
- verbose:
True: provide real-time feedback
False: don’t provide feedback (default)
- report:
True: print a report upon completion
False: don’t print report (default)
- returns:
True: the criteria was satisfied
False: the critera was not satisfied
- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> def isint(x): try : return int(x)-float(x)==0 except: return False >>> df.validate({'age' : lambda x: isint(x), 'gender' : lambda x: x in ['male', 'female']}, verbose=True, report=True) Validating gender: ..... Validating age: ..... Report: Values tested: 10 Values passed: 10 Values failed: 0 ***Validation PASSED*** True >>>
- DataFrame.types()¶
- returns a list of the sqlite3 datatypes of the columns
- args:
None
- returns:
an ordered list of sqlite3 types.
order matches self.keys()
- DataFrame.shape()¶
returns the size of the data in the table as a tuple
- args:
None
- returns:
tuple (number of columns, number of rows)
- DataFrame.bind_aggregate(name, arity, func)¶
binds a sqlite3 aggregator to
DataFrame
- args:
name: string to be associated with the aggregator
arity: the number of inputs required by the aggregator
func: the aggregator class
- returns:
None
DataFrame
.aggregates is a list of the available aggregators.For information on rolling your own aggregators see: http://docs.python.org/library/sqlite3.html
pyvttbl.plotting
Wrappers¶
Methods to visualize data.
- DataFrame.histogram_plot(val, **kwargs)¶
Makes a histogram plot
- args:
key: column label of dependent variable
- kwds:
where: criterion to apply to table before running analysis
bins: number of bins (default = 10)
range: list of length 2 defining min and max bin edges
- DataFrame.scatter_plot(aname, bname, **kwargs)¶
Creates a scatter plot with the specified parameters
- args:
aname: variable on x-axis
bname: variable on y-axis
- kwds:
- alpha:
amount of transparency applied
- trend :
None: no model fitting
‘linear’: f(x) = a + b*x
‘exponential’: f(x) = a * x**b
‘logarithmic’: f(x) = a * log(x) + b
‘polynomial’: f(x) = a * x**2 + b*x + c
‘power’: f(x) = a * x**b
- DataFrame.box_plot(val, factors=None, **kwargs)¶
Makes a box plot
- args:
- df:
a pyvttbl.DataFrame object
- val:
the label of the dependent variable
- kwds:
- factors:
a list of factors to include in boxplot
- where:
a string, list of strings, or list of tuples applied to the DataFrame before plotting
- fname:
output file name
- quality:
{‘low’ | ‘medium’ | ‘high’} specifies image file dpi
- DataFrame.interaction_plot(val, xaxis, **kwargs)¶
makes an interaction plot
- args:
- df:
a pyvttbl.DataFrame object
- val:
the label of the dependent variable
- xaxis:
the label of the variable to place on the xaxis of each subplot
- kwds:
- seplines:
label specifying seperate lines in each subplot
- sepxplots:
label specifying seperate horizontal subplots
- sepyplots:
label specifying separate vertical subplots
- xmin:
(‘AUTO’ by default) minimum xaxis value across subplots
- xmax:
(‘AUTO’ by default) maximum xaxis value across subplots
- ymin:
(‘AUTO’ by default) minimum yaxis value across subplots
- ymax:
(‘AUTO’ by default) maximum yaxis value across subplots
- where:
a string, list of strings, or list of tuples applied to the DataFrame before plotting
- fname:
output file name
- quality:
{‘low’ | ‘medium’ | ‘high’} specifies image file dpi
- yerr:
{float, ‘ci’, ‘stdev’, ‘sem’} designates errorbars across datapoints in all subplots
- DataFrame.scatter_matrix(variables, **kwargs)¶
Plots a matrix of scatterplots
- args:
- variables:
column labels to include in scatter matrix
- kwds:
- alpha:
amount of transparency applied
- grid:
setting this to True will show the grid
- diagonal:
‘kde’: Kernel Density Estimation
‘hist’: 20 bin Histogram
None: just labels
- trend :
None: no model fitting
‘linear’: f(x) = a + b*x (default)
‘exponential’: f(x) = a * x**b
‘logarithmic’: f(x) = a * log(x) + b
‘polynomial’: f(x) = a * x**2 + b*x + c
‘power’: f(x) = a * x**b
- alternate_labels: Specifies whether the labels and ticks should
alternate. Default is True. When False tick labels will be on the left and botttom, and variable labels will be on the top and right.
pyvttbl.stats
Wrappers¶
Methods to conduct descriptive and inferential statistical analyses.
- DataFrame.descriptives(key, where=None)¶
Conducts a descriptive statistical analysis of the data in self[key].
- args:
key: column label
- kwds:
where: criterion to apply to table before running analysis
- returns:
a
pyvttbl.stats
.Descriptives
object
- DataFrame.histogram(key, where=None, bins=10, range=None, density=False, cumulative=False)¶
Conducts a histogram analysis of the data in self[key].
- args:
key: column label of dependent variable
- kwds:
where: criterion to apply to table before running analysis
bins: number of bins (default = 10)
range: list of length 2 defining min and max bin edges
- returns:
a
pyvttbl.stats
.Descriptives
object
- DataFrame.marginals(key, factors, where=None)¶
- DataFrame.anova1way(val, factor, posthoc='tukey', where=None)¶
Conducts a one-way analysis of variance on val over the conditions in factor. The conditions do not necessarily need to have equal numbers of samples.
- args:
val: dependent variable
factor: a dummy coded column label
- kwds:
- posthoc:
‘tukey’: conduct Tukey posthoc tests
‘SNK’: conduct Newman-Keuls posthoc tests
- where:
conditions to apply before running analysis
- return:
an
pyvttbl.stats.Anova1way
object
- DataFrame.anova(dv, sub='SUBJECT', wfactors=None, bfactors=None, measure='', transform='', alpha=0.05)¶
conducts a betweeen, within, or mixed, analysis of variance
- args:
dv: label containing dependent variable
- kwds:
wfactors: list of within variable factor labels
bfactors: list of between variable factor labels
sub: label coding subjects (or the isomorphism)
- measure: string to describe dv (outputs ‘<dv> of
<measure>’) intended when dv name is generic (e.g., MEAN, RMS, SD, …)
transform: string specifying a data transformation
STRING OPTION
TRANSFORM
COMMENTS
‘’
X
default
‘log’,’log10’
numpy.log(X)
base 10 transform
‘reciprocal’, ‘inverse’
1/X
‘square-root’, ‘sqrt’
numpy.sqrt(X)
‘arcsine’, ‘arcsin’
numpy.arcsin(X)
‘windsor 10’
windsor(X, 10)
10% windosr trim
- DataFrame.chisquare1way(observed, expected_dict=None, alpha=0.05, where=None)¶
conducts a one-way chi-square goodness-of-fit test on the data in observed
- args:
observed: column label containing categorical observations
- kwds:
- expected_dict: a dictionary object with keys matching the categories
in observed and values with the expected counts. The categories in the observed column must be a subset of the keys in the expected_dict. If expected_dict is None, the total N is assumed to be equally distributed across all groups.
alpha: the type-I error probability
- where:
conditions to apply before running analysis
- return:
an
pyvttbl.stats.ChiSquare1way
object
- DataFrame.chisquare2way(rfactor, cfactor, alpha=0.05, where=None)¶
conducts a two-way chi-square goodness-of-fit test on the data in observed
- args:
rfactor: column key
cfactor: column key
- kwds:
alpha: the type-I error probability
- where:
conditions to apply before running analysis
- return:
an
pyvttbl.stats.ChiSquare2way
object
- DataFrame.correlation(variables, coefficient='pearson', alpha=0.05, where=None)¶
produces a correlation matrix and conducts step-down significance testing on the column labels in variables.
- args:
variables: column keys to include in correlation matrix
- kwds:
- coefficient:
{ ‘pearson’, ‘spearman’, ‘kendalltau’, ‘pointbiserial’ }
alpha: the type-I error probability
- where:
conditions to apply before running analysis
- return:
an
pyvttbl.stats.Correlation
object
- DataFrame.ttest(aname, bname=None, pop_mean=0.0, paired=False, equal_variance=True, where=None)¶
produces a correlation matrix and conducts step-down significance testing on the column labels in variables.
- args:
aname: column key
- kwds:
- bname: is not specified a one-sample t-test is performed on
comparing the values in column aname with a hypothesized population mean.
- pop_mean: specifies the null population mean for one-sample t-test.
Ignored if bname is supplied
- paired:
True: a paired t-test is conducted
False: an independent samples t-test is conducted
- equal_variance:
True: assumes aname and bname have equal variance
False: assumes aname and bname have unequal variance
- where:
conditions to apply before running analysis
- return:
an
pyvttbl.stats.Ttest
object
Private Methods¶
These methods are for working interfacing DataFrame
with sqlite3.
- DataFrame._get_sqltype(key)¶
returns the sqlite3 type associated with the provided key
- args:
key: key in
DataFrame
(raises KeyError if key not in self)- returns:
- a string specifiying the sqlite3 type associated with the data in self[key]:
{ ‘null’, ‘integer’, ‘real’, ‘text’}
- DataFrame._get_nptype(key)¶
returns the numpy type object associated with the provided key
- args:
key: key in
DataFrame
(raises KeyError if key not in self)- returns:
a numpy object specifiying the type associated with the data in self[key]:
sql type
numpy type
‘null’
np.dtype(object)
‘integer’
np.dtype(int)
‘real’
np.dtype(float)
‘text’
np.dtype(str)
- DataFrame._get_mafillvalue(key)¶
returns the default fill value for invalid data associated with the provided key.
- args:
key: key in
DataFrame
(raises KeyError if key not in self)- returns:
string, float, or int associated with the data in self[key]
sql type
default
‘null’
‘?’
‘integer’
999999
‘real’
1e20
‘text’
‘N/A’
returned values match the defaults associated with np.ma.MaskedArray
- DataFrame._are_col_lengths_equal()¶
private method to check if the items in self have equal lengths
- args:
None
- returns:
returns True if all the items are equal
returns False otherwise
- DataFrame._determine_sqlite3_type(iterable)¶
determine the sqlite3 datatype of iterable
- args:
iterable: a 1-d iterable (list, tuple, np.array, etc.)
- returns:
sqlite3 type as string: ‘null’, ‘integer’, ‘real’, or ‘text’
- DataFrame._execute(query, t=None)¶
private method to execute sqlite3 query
When the PRINTQUERIES bool is true it prints the queries before executing them
- DataFrame._executemany(query, tlist)¶
private method to execute sqlite3 queries
When the PRINTQUERIES bool is true it prints the queries before executing them. The execute many method is about twice as fast for building tables as the execute method.
- DataFrame._get_indices_where(where)¶
determines the indices cooresponding to the conditions specified by the where argument.
- args:
where: a string criterion without the ‘where’
- returns:
a list of indices
- DataFrame._build_sqlite3_tbl(nsubset, where=None)¶
build or rebuild sqlite table with columns in nsubset based on the where list.
- args:
nsubset: a list of keys to include in the table
where: criterion the entries in the table must satisfy
- returns:
None
where can be a list of tuples. Each tuple should have three elements. The first should be a column key (label). The second should be an operator: in, =, !=, <, >. The third element should contain value for the operator.where can also be a list of strings. or a single string.sqlite3 table is built in memory and has the id TBL