DataFrame Overview

Description

DataFrame objects are containers for holding tabulated data. They inherent collections.OrderedDict and hold data in numpy.array objects.

Public Methods

Methods to get data into a DataFrame, manipulate and manage data, and write data. For the most part these remove the user from the fact they are using sqlite3 to manipulate data.


DataFrame.__init__(*args, **kwds)

initialize a DataFrame object.

Subclass of collections. OrderedDict.
Understands the typical initialization dict signatures.
Keys must be hashable.
Values become numpy.arrays or numpy.ma.MaskedArrays.

DataFrame.__setitem__(key, item, mask=None)

assign a column in the table

args:

key: hashable object to associate with item

item: an iterable that is put in an np.array or np.ma.array

kwds:

mask: mask value passed to np.ma.MaskedArray.__init__()

returns:

None

df.__setitem__(key, item) <==> df[key] = item
The assigned item must be iterable. To add a single row use the insert method. To another table to this one use the attach method.
example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female 
>>> import numpy as np
>>> df['log10(age)'] = np.log10(df['age'])
>>> print(df)
first      last       age   gender   log10(age) 
===============================================
Roger   Lew            28   male          1.447 
Bosco   Robinson        5   male          0.699 
Megan   Whittington    26   female        1.415 
John    Smith          51   male          1.708 
Jane    Doe            49   female        1.690 
>>> 

DataFrame.__delitem__(key)

delete a column from the table

args:

key: associated with the item to delete

returns:

None

df.__delitem__(key) <==> del df[key]
example:
>>> ...
>>> print(df)
first      last       age   gender   log10(age) 
===============================================
Roger   Lew            28   male          1.447 
Bosco   Robinson        5   male          0.699 
Megan   Whittington    26   female        1.415 
John    Smith          51   male          1.708 
Jane    Doe            49   female        1.690 
>>> del df['log10(age)']
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female 
>>> 

DataFrame.__str__()

returns human friendly string representation of object

args:

None

returns:

string with easy to read representation of table

df.__str__() <==> str(df)

DataFrame.read_tbl(fname, skip=0, delimiter=',', labels=True)

loads tabulated data from a plain text file

args:

fname: path and name of datafile

kwds:

skip: number of lines to skip before looking for column labels. (default = 0)

delimiter: string to seperate values (default = “’”)

labels: bool specifiying whether first row (after skip) contains labels. (default = True)

returns:

None

Checks and renames duplicate column labels as well as checking
for missing cells. readTbl will warn and skip over missing lines.

DataFrame.attach(other)

attaches a second DataFrame to self

args:

other: a DataFrame object whose key set matches self

return:

None


DataFrame.insert(row)

insert a row into the table

args:

row: should be mappable. e.g. a dict or a list with key/value pairs.

returns:

None

example:
>>> from pyvttbl import DataFrame
>>> from collections import namedtuple
>>> Person = namedtuple('Person',['first','last','age','gender'])
>>> df =DataFrame()
>>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict())
>>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict())
>>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict())
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
>>> 

DataFrame.sort(order=None)

sort the table in-place

kwds:

order: is a list of factors to sort by to reverse order append ” desc” to the factor

returns:

None

example:
>>> from pyvttbl import DataFrame
>>> from collections import namedtuple
>>> Person = namedtuple('Person',['first','last','age','gender'])
>>> df =DataFrame()
>>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict())
>>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict())
>>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict())
>>> df.insert(Person('John', 'Smith', 51, 'male')._asdict())
>>> df.insert(Person('Jane', 'Doe', 49, 'female')._asdict())
>>> df.sort(['gender', 'age'])
>>> print(df)
first      last       age   gender 
==================================
Megan   Whittington    26   female 
Jane    Doe            49   female 
Bosco   Robinson        5   male   
Roger   Lew            28   male   
John    Smith          51   male   
>>> 

DataFrame.select_col(key, where=None)

determines rows in table that satisfy the conditions given by where and returns the values of key in the remaining rows

args:

key: column label of data to return

kwds:

where: constraints to apply to table before returning data

returns:

a list

example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female 
>>> df.select_col('age', where='gender == "male"')
[28, 5, 51]
>>> 

DataFrame.row_iter()

iterate over the rows in table

args:

None

returns:

iterator that yields OrderedDict objects with (key,value) pairs cooresponding to the data in each row

example:
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   male   
John    Smith          51   female 
Jane    Doe            49   female 
>>> for case in df.row_iter():
        print(case)
OrderedDict([('first', 'Roger'), ('last', 'Lew'), ('age', 28), ('gender', 'male')])
OrderedDict([('first', 'Bosco'), ('last', 'Robinson'), ('age', 5), ('gender', 'male')])
OrderedDict([('first', 'Megan'), ('last', 'Whittington'), ('age', 26), ('gender', 'male')])
OrderedDict([('first', 'John'), ('last', 'Smith'), ('age', 51), ('gender', 'female')])
OrderedDict([('first', 'Jane'), ('last', 'Doe'), ('age', 49), ('gender', 'female')])
>>> 

DataFrame.pivot(val, rows=None, cols=None, aggregate='avg', where=None, attach_rlabels=False, method='valid')

produces a contingency table according to the arguments and keywords provided.

args:

val: the colname to place as the data in the table

kwds:
rows: list of colnames whos combinations will become rows

in the table if left blank their will be one row

cols: list of colnames whos combinations will become cols

in the table if left blank their will be one col

aggregate: function applied across data going into each cell

of the table <http://www.sqlite.org/lang_aggfunc.html>_

where: list of tuples or list of strings for filtering data

method:

‘valid’: only returns rows or columns with valid entries.

‘full’: return full factorial combinations of the

conditions specified by rows and cols

returns:

PyvtTbl object


DataFrame.where(where)

Applies the where filter to a copy of the DataFrame, and returns the new DataFrame. The associated DataFrame is not copied.

args:

where: criterion to apply to new table

returns:

a new DataFrame

example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female
>>> print(df.where('age > 20 and age < 45'))
first      last       age   gender 
==================================
Roger   Lew            28   male   
Megan   Whittington    26   female 
>>> 

DataFrame.where_update(where)

Applies the where filter in-place.

args:

where: criterion to apply to table

returns:

None


DataFrame.summary(where=None)

prints the descriptive information for each column in DataFrame

kwds:

where: criterion to apply to table before running analysis

returns:

None


DataFrame.validate(criteria, verbose=False, report=False)

validate the data in the table.

args:

criteria: a dict whose keys should coorespond to columns in the table. The values should be functions which take a single parameter and return a boolean.

kwds:
verbose:

True: provide real-time feedback

False: don’t provide feedback (default)

report:

True: print a report upon completion

False: don’t print report (default)

returns:

True: the criteria was satisfied

False: the critera was not satisfied

example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female
>>> def isint(x):
        try : return int(x)-float(x)==0
        except:  return False
>>> df.validate({'age' : lambda x: isint(x),
                 'gender' : lambda x: x in ['male', 'female']},
                 verbose=True, report=True)                    
Validating gender:
.....
Validating age:
.....
Report:
  Values tested: 10 
  Values passed: 10 
  Values failed: 0
***Validation PASSED***
True
>>>

DataFrame.types()
returns a list of the sqlite3 datatypes of the columns
args:

None

returns:

an ordered list of sqlite3 types.

order matches self.keys()

DataFrame.shape()

returns the size of the data in the table as a tuple

args:

None

returns:

tuple (number of columns, number of rows)


DataFrame.bind_aggregate(name, arity, func)

binds a sqlite3 aggregator to DataFrame

args:

name: string to be associated with the aggregator

arity: the number of inputs required by the aggregator

func: the aggregator class

returns:

None

DataFrame.aggregates is a list of the available aggregators.
For information on rolling your own aggregators see: http://docs.python.org/library/sqlite3.html

pyvttbl.plotting Wrappers

Methods to visualize data.


DataFrame.histogram_plot(val, **kwargs)

Makes a histogram plot

args:

key: column label of dependent variable

kwds:

where: criterion to apply to table before running analysis

bins: number of bins (default = 10)

range: list of length 2 defining min and max bin edges


DataFrame.scatter_plot(aname, bname, **kwargs)

Creates a scatter plot with the specified parameters

args:

aname: variable on x-axis

bname: variable on y-axis

kwds:
alpha:

amount of transparency applied

trend :

None: no model fitting

‘linear’: f(x) = a + b*x

‘exponential’: f(x) = a * x**b

‘logarithmic’: f(x) = a * log(x) + b

‘polynomial’: f(x) = a * x**2 + b*x + c

‘power’: f(x) = a * x**b


DataFrame.box_plot(val, factors=None, **kwargs)

Makes a box plot

args:
df:

a pyvttbl.DataFrame object

val:

the label of the dependent variable

kwds:
factors:

a list of factors to include in boxplot

where:

a string, list of strings, or list of tuples applied to the DataFrame before plotting

fname:

output file name

quality:

{‘low’ | ‘medium’ | ‘high’} specifies image file dpi


DataFrame.interaction_plot(val, xaxis, **kwargs)

makes an interaction plot

args:
df:

a pyvttbl.DataFrame object

val:

the label of the dependent variable

xaxis:

the label of the variable to place on the xaxis of each subplot

kwds:
seplines:

label specifying seperate lines in each subplot

sepxplots:

label specifying seperate horizontal subplots

sepyplots:

label specifying separate vertical subplots

xmin:

(‘AUTO’ by default) minimum xaxis value across subplots

xmax:

(‘AUTO’ by default) maximum xaxis value across subplots

ymin:

(‘AUTO’ by default) minimum yaxis value across subplots

ymax:

(‘AUTO’ by default) maximum yaxis value across subplots

where:

a string, list of strings, or list of tuples applied to the DataFrame before plotting

fname:

output file name

quality:

{‘low’ | ‘medium’ | ‘high’} specifies image file dpi

yerr:

{float, ‘ci’, ‘stdev’, ‘sem’} designates errorbars across datapoints in all subplots


DataFrame.scatter_matrix(variables, **kwargs)

Plots a matrix of scatterplots

args:
variables:

column labels to include in scatter matrix

kwds:
alpha:

amount of transparency applied

grid:

setting this to True will show the grid

diagonal:

‘kde’: Kernel Density Estimation

‘hist’: 20 bin Histogram

None: just labels

trend :

None: no model fitting

‘linear’: f(x) = a + b*x (default)

‘exponential’: f(x) = a * x**b

‘logarithmic’: f(x) = a * log(x) + b

‘polynomial’: f(x) = a * x**2 + b*x + c

‘power’: f(x) = a * x**b

alternate_labels: Specifies whether the labels and ticks should

alternate. Default is True. When False tick labels will be on the left and botttom, and variable labels will be on the top and right.


pyvttbl.stats Wrappers

Methods to conduct descriptive and inferential statistical analyses.


DataFrame.descriptives(key, where=None)

Conducts a descriptive statistical analysis of the data in self[key].

args:

key: column label

kwds:

where: criterion to apply to table before running analysis

returns:

a pyvttbl.stats. Descriptives object


DataFrame.histogram(key, where=None, bins=10, range=None, density=False, cumulative=False)

Conducts a histogram analysis of the data in self[key].

args:

key: column label of dependent variable

kwds:

where: criterion to apply to table before running analysis

bins: number of bins (default = 10)

range: list of length 2 defining min and max bin edges

returns:

a pyvttbl.stats. Descriptives object


DataFrame.marginals(key, factors, where=None)

DataFrame.anova1way(val, factor, posthoc='tukey', where=None)

Conducts a one-way analysis of variance on val over the conditions in factor. The conditions do not necessarily need to have equal numbers of samples.

args:

val: dependent variable

factor: a dummy coded column label

kwds:
posthoc:

‘tukey’: conduct Tukey posthoc tests

‘SNK’: conduct Newman-Keuls posthoc tests

where:

conditions to apply before running analysis

return:

an pyvttbl.stats.Anova1way object


DataFrame.anova(dv, sub='SUBJECT', wfactors=None, bfactors=None, measure='', transform='', alpha=0.05)

conducts a betweeen, within, or mixed, analysis of variance

args:

dv: label containing dependent variable

kwds:

wfactors: list of within variable factor labels

bfactors: list of between variable factor labels

sub: label coding subjects (or the isomorphism)

measure: string to describe dv (outputs ‘<dv> of

<measure>’) intended when dv name is generic (e.g., MEAN, RMS, SD, …)

transform: string specifying a data transformation

STRING OPTION

TRANSFORM

COMMENTS

‘’

X

default

‘log’,’log10’

numpy.log(X)

base 10 transform

‘reciprocal’, ‘inverse’

1/X

‘square-root’, ‘sqrt’

numpy.sqrt(X)

‘arcsine’, ‘arcsin’

numpy.arcsin(X)

‘windsor 10’

windsor(X, 10)

10% windosr trim


DataFrame.chisquare1way(observed, expected_dict=None, alpha=0.05, where=None)

conducts a one-way chi-square goodness-of-fit test on the data in observed

args:

observed: column label containing categorical observations

kwds:
expected_dict: a dictionary object with keys matching the categories

in observed and values with the expected counts. The categories in the observed column must be a subset of the keys in the expected_dict. If expected_dict is None, the total N is assumed to be equally distributed across all groups.

alpha: the type-I error probability

where:

conditions to apply before running analysis

return:

an pyvttbl.stats.ChiSquare1way object


DataFrame.chisquare2way(rfactor, cfactor, alpha=0.05, where=None)

conducts a two-way chi-square goodness-of-fit test on the data in observed

args:

rfactor: column key

cfactor: column key

kwds:

alpha: the type-I error probability

where:

conditions to apply before running analysis

return:

an pyvttbl.stats.ChiSquare2way object


DataFrame.correlation(variables, coefficient='pearson', alpha=0.05, where=None)

produces a correlation matrix and conducts step-down significance testing on the column labels in variables.

args:

variables: column keys to include in correlation matrix

kwds:
coefficient:

{ ‘pearson’, ‘spearman’, ‘kendalltau’, ‘pointbiserial’ }

alpha: the type-I error probability

where:

conditions to apply before running analysis

return:

an pyvttbl.stats.Correlation object


DataFrame.ttest(aname, bname=None, pop_mean=0.0, paired=False, equal_variance=True, where=None)

produces a correlation matrix and conducts step-down significance testing on the column labels in variables.

args:

aname: column key

kwds:
bname: is not specified a one-sample t-test is performed on

comparing the values in column aname with a hypothesized population mean.

pop_mean: specifies the null population mean for one-sample t-test.

Ignored if bname is supplied

paired:

True: a paired t-test is conducted

False: an independent samples t-test is conducted

equal_variance:

True: assumes aname and bname have equal variance

False: assumes aname and bname have unequal variance

where:

conditions to apply before running analysis

return:

an pyvttbl.stats.Ttest object


Private Methods

These methods are for working interfacing DataFrame with sqlite3.


DataFrame._get_sqltype(key)

returns the sqlite3 type associated with the provided key

args:

key: key in DataFrame (raises KeyError if key not in self)

returns:
a string specifiying the sqlite3 type associated with the data in self[key]:

{ ‘null’, ‘integer’, ‘real’, ‘text’}


DataFrame._get_nptype(key)

returns the numpy type object associated with the provided key

args:

key: key in DataFrame (raises KeyError if key not in self)

returns:

a numpy object specifiying the type associated with the data in self[key]:

sql type

numpy type

‘null’

np.dtype(object)

‘integer’

np.dtype(int)

‘real’

np.dtype(float)

‘text’

np.dtype(str)


DataFrame._get_mafillvalue(key)

returns the default fill value for invalid data associated with the provided key.

args:

key: key in DataFrame (raises KeyError if key not in self)

returns:

string, float, or int associated with the data in self[key]

sql type

default

‘null’

‘?’

‘integer’

999999

‘real’

1e20

‘text’

‘N/A’

returned values match the defaults associated with np.ma.MaskedArray

DataFrame._are_col_lengths_equal()

private method to check if the items in self have equal lengths

args:

None

returns:

returns True if all the items are equal

returns False otherwise


DataFrame._determine_sqlite3_type(iterable)

determine the sqlite3 datatype of iterable

args:

iterable: a 1-d iterable (list, tuple, np.array, etc.)

returns:

sqlite3 type as string: ‘null’, ‘integer’, ‘real’, or ‘text’


DataFrame._execute(query, t=None)

private method to execute sqlite3 query

When the PRINTQUERIES bool is true it prints the queries before executing them

DataFrame._executemany(query, tlist)

private method to execute sqlite3 queries

When the PRINTQUERIES bool is true it prints the queries before executing them. The execute many method is about twice as fast for building tables as the execute method.

DataFrame._get_indices_where(where)

determines the indices cooresponding to the conditions specified by the where argument.

args:

where: a string criterion without the ‘where’

returns:

a list of indices


DataFrame._build_sqlite3_tbl(nsubset, where=None)

build or rebuild sqlite table with columns in nsubset based on the where list.

args:

nsubset: a list of keys to include in the table

where: criterion the entries in the table must satisfy

returns:

None

where can be a list of tuples. Each tuple should have three elements. The first should be a column key (label). The second should be an operator: in, =, !=, <, >. The third element should contain value for the operator.
where can also be a list of strings. or a single string.
sqlite3 table is built in memory and has the id TBL