Working with SNP Arrays

The SNP Array class

SNP_array(zipfile, fileformat = ‘one column’, delim = ‘,’, samp_col = None, encoding = None, heade_lines = 0, startatline = 0, readnrows = None):

parameters:

zipfile - File with data in it. Will try gunzip, but if this fails, it will read it as plain text.

fileformat - Either ‘one column’ or ‘two column’ depending on wheter each individual’s alleles are shown in one or two columsn. Defaults to ‘one column’.

delim - What seperates the values in your data? Defaults to ‘,’ but other common options are tabs (‘t’) or spaces (‘ ‘).

samp_col - What is the first column where the data for each sample starts. If no value is supplied, it defaults to 3 for one column data and 4 for two column data.

encoding - This arguement will take either an integer or a pandas series. If an integer is supplied, the encoder will be taken from the corresponding column (0 based of course). Alternatively a pandas series which contains the reference/alternate alleles (eg. ‘A/G’) for each position with ids that correspond to the file can be supplied. If no encoder is supplied, no genotype matrix will be created.

header_lines - The number of lines to be read as the reader.

startatline - What line to start reading the data at.

readnrows - How many lines of data to read. When None, the file is read until the end.

Data

SNP_array.df
Data that was passed in parsed as a pandas dataframe.
SNP_array.encoder
The encoder passed in on creation of the object.
SNP_array.geno
A pandas dataframe of the genotype data. 0 represents homozygous for the reference allele, 1 is heterozygous, and 2 is homozygous for the alternate allele.
SNP_array.apply_encoder(encoder)

If an encoder was not supplied upon creating the object, this is how you could still get a genotype matrix after the fact.

parameters:
encoder - An integer that corresponds to the appropriate column or a pandas series which contains the reference/alternate alleles (eg. ‘A/G’) for each position with ids that correspond to the file.
output:
A pandas dataframe of genotype data. 0 represents homozygous for the reference allele, 1 is heterozygous, and 2 is homozygous f or the alternate allele.