phylovelo

Submodules

Classes

`GeneExpr`	Gene expression program
`Cell`	Cell class
`Reaction`	Cell division/differentiation type
`Gillespie`	Gillespie simulation
`Gene`	Gene class
`GeneExpr`	Gene expression program
`scData`	Data structure for PhyloVelo analysis
`Gillespie`	Gillespie simulation

Functions

`get_annotation`(file)	Get simulation data annotation
`sim_base_expr`(tree, cell_states, Ngene, ...[, ...])	Simulation base expression
`add_lineage_noise`(tree, base_expr_mat[, scale])	Simulate lineage noise
`get_count_from_base_expr`(base_expr_mat[, alpha])	Draw gene expression count from base expression matrix
`get_count`(paras)	Draw random sample form NB distribution with paras = (r, p)
`reconstruct`(file[, output, seed, is_balance])	Reconstruct phylogenetic tree of simulation data
`wirte_lineage_info`(filepath, anc_cells, curr_cells, ...)	Record lineage infomation in simulation
`loadtree`(file)	Reformat tree file from simulation data
`logNormalize`(data[, scaling])	Log normalize data
`plot_tree`(tree, colors, ax, colortab, , , , ], stain)	Draw phylogenetic tree
`get_weight`(x, distance, scale, length)	Weight sum the velocity to grid
`generate_grid`([xlim, ylim, density])	Generate grid to project velocities.
`velocity_embedding_to_grid`(pts, vel, nn[, radius, ...])	Project velocities to grid
`velocity_plot`(pts, vel, ax, figtype, grid[, point, ...])	Project velocities into embedding
`mullerplot`(data, label, color[, absolute, alpha, ax])	Draw mullerplot
`label_name`(loc, cell_types, ax[, fontsize, font])	Label cell type names on figures.
`corr_plot`(x, y, ax[, stats, r0_x, r0_y, r1_x, r1_y, ...])	Draw a scatter plot of the two sets of data and show their correlation coefficients
`velocity_inference`(sd[, time, cutoff, alpha, target, ...])	Inference phylogenetic velocity
`velocity_embedding`(sd[, target, n_neigh, chunk_size])	Project velocity into embedding
`calc_meg_pseudotime`(sd[, target, genes, ...])	Calculate pseudotime directly from robustly oriented MEG expression.
`calc_phylo_pseudotime`(sd[, n_neighbors, r_sample, ...])	Calculate the phyloVelo pseudotime

Package Contents

class GeneExpr(Ngene: int, r_variant_gene: float, diff_map: dict, state_time: dict, forward_map: dict = None)

Gene expression program

Args:

Ngene:: Gene number
r_variant_gene:: Ratio of gene changes with differentiation
diff_map:: Differentiation relationships between different cell types {a:[b,c]} means ‘a’ is differentiated from ‘b’ and ‘c’
state_time:: Pseudo time of each states
forward_map:: Only use in convergent model simulation {a:b} means ‘a’ will differentiated to ‘b’

Ngene

Nstates

variant_gene

expr_rec

cells_genetype

diff_map

state_time

forward_map = None

generate_genes(mu0_loc: float = 20, mu0_scale: float = 3, drift_loc: float = 0, drift_scale: float = 1)

Generate genes

Args:

mu0_loc:: Mean of initial expression
mu0_scale:: Variation of initial expression
drift_loc:: Mean of gene drift
drift_scale:: Variation of drift

expr(state, time)

get_annotation(file)

Get simulation data annotation

Args:

file:: Tree file from simulation script

Return:

list:: Cell names
list:: Cell states
list:: Cell generations

sim_base_expr(tree: bio.phylo.tree, cell_states: pandas.DataFrame, Ngene: int, r_variant_gene: float, diff_map: dict, forward_map: dict = {}, mu0_loc=20, mu0_scale=3, drift_loc=0, drift_scale=1, pseudo_state_time: dict = None)

Simulation base expression

Args:

tree:: Phylogenetic tree
cell_states:: DataFrame of cell types with index of cell names
Ngene:: Gene number
r_variant_gene:: Ratio of gene changes with differentiation
diff_map:: Differentiation relationships between different cell types {a:[b,c]} means ‘a’ is differentiated from ‘b’ and ‘c’
state_time:: Pseudo time of each states
forward_map:: Only use in convergent model simulation {a:b} means ‘a’ will differentiated to ‘b’
mu0_loc:: Mean of initial expression
mu0_scale:: Variation of initial expression
drift_loc:: Mean of gene drift
drift_scale:: Variation of drift

Returns:

class:: Gene expr program
pd.DataFrame:: base expression matrix

add_lineage_noise(tree: bio.phylo.tree, base_expr_mat: pandas.DataFrame, scale=0.0001)

Simulate lineage noise

Args:

tree:: Phylogenetic tree
base_expr_mat:: Base expression matrix from sim_base_expr
scale:: Lineage noise scale

Return:

pd.DataFrame:: Base expression matrix with lineage noise

get_count_from_base_expr(base_expr_mat: pandas.DataFrame, alpha: int = 3)

Draw gene expression count from base expression matrix

Args:

base_expr_mat:: Base expression matrix
alpha:: Scale parameter of NB distribution

Return:

Gene count matrix

get_count(paras: list)

Draw random sample form NB distribution with paras = (r, p)

Args:

paras:: NB parameters, [(r,p)]

Return:

int:: Random sample

reconstruct(file: str, output: str = None, seed: int = None, is_balance: bool = False, **kwargs)

Reconstruct phylogenetic tree of simulation data

Args:

file:: Simulation file path
output:: Output newick file path
seed:: Random seed
is_balance:: Is all cell types’ cell number equal
ratio:: How many cells to reconstruct

Return:

newick tree at output file

wirte_lineage_info(filepath, anc_cells, curr_cells, curr_time): Record lineage infomation in simulation

class Cell(Ngene: int = None, state: int = 0, gen: int = None, cid: int = None, parent: int = None, tb: float = None, td: float = None)

Cell class

Args:

Ngene:: Gene number
state:: Cell type
gen:: Cell generation
cid:: Cell id
parent:: Cell’s parent
tb:: Birth time
td:: Death time

state = 0

parent = None

cid = None

gen = None

tb = None

td = None

class Reaction(rate: callable = None, num_lefts: list = None, num_rights: list = None, index: int = None)

Cell division/differentiation type

Args:

rate:: reaction rate function
num_lefts:: Cell numbers before reaction
num_right:: Cell numbers after reaction
index:: Reaction index

rate = None

num_lefts

num_rights

num_diff

index = None

combine(n, s)

propensity(n, t)

class Gillespie(num_elements: int, inits: list = None, max_cell_num: int = 20000)

Gillespie simulation

Args:

num_elements:: Cell type number
inits:: Initial cell number
max_cell_num:: Maximum cell number

num_elements

reactions = []

anc_cells = []

curr_cells

generation_time = [0]

max_cell_num = 20000

add_reaction(rate: callable = None, num_lefts: list = None, num_rights: list = None, index: int = None)

Add reactions to simulation

Args:

rate:: reaction rate function
num_lefts:: Cell numbers before reaction
num_right:: Cell numbers after reaction
index:: Reaction index

evolute(steps: int)

Run simulation

Args:

steps:: How many steps to evolute before step

class Gene(mu0: float, drift: float, sigma: float = None, t0: int = 0)

Gene class

Args:

mu0:: Initial expression
drift:: Drift coefficient of DP
sigma:: Diffusion coefficient of DP
t0:: Gene initial time

mu0

drift

t0 = 0

base_expr

diffusion(): Diffusion one step.

base_expr_calc(t: int)

Calculate base expression

Args

t:: time

Return:

Base expression at time t

class GeneExpr(Ngene: int, r_variant_gene: float, diff_map: dict, state_time: dict, forward_map: dict = None)

Gene expression program

Args:

Ngene:: Gene number
r_variant_gene:: Ratio of gene changes with differentiation
diff_map:: Differentiation relationships between different cell types {a:[b,c]} means ‘a’ is differentiated from ‘b’ and ‘c’
state_time:: Pseudo time of each states
forward_map:: Only use in convergent model simulation {a:b} means ‘a’ will differentiated to ‘b’

Ngene

Nstates

variant_gene

expr_rec

cells_genetype

diff_map

state_time

forward_map = None

generate_genes(mu0_loc: float = 20, mu0_scale: float = 3, drift_loc: float = 0, drift_scale: float = 1)

Generate genes

Args:

mu0_loc:: Mean of initial expression
mu0_scale:: Variation of initial expression
drift_loc:: Mean of gene drift
drift_scale:: Variation of drift

expr(state, time)

loadtree(file)

Reformat tree file from simulation data

Args:

file(str):: File path generated by simulation code

Returns:

Bio.Phylo.Tree:: biopython’s phylo tree
list[str]:: cell types of leave nodes

logNormalize(data, scaling=1)

Log normalize data

Arg:

data(pandas.DataFrame, numpy.array):: expression data
scaling(int):: Normalization scale

Return:

normalized data

plot_tree(tree, colors, ax: matplotlib.axes, colortab: list = ['gray', 'blue', 'green', 'orange', 'purple'], stain: str: 'all' or 'terminals' = 'all')

Draw phylogenetic tree

Args:

tree:: Load from loadtree
colors:: Load from loadtree
ax:: matplotlib axes to draw on
colortab:: A list of colors to paint different cell types
stain:: ‘all’ for color all branches, ‘terminals’ for color only terminals branches

Return:

matplotlib.axes

get_weight(x: list, distance: list, scale, length: int)

Weight sum the velocity to grid

Args:

x:: neighbors
distance:: List of distance to neighbors
scale:: Scale factor
length:: Length of neighbors

Return:

Weighted velocities

generate_grid(xlim=(-1, 1), ylim=(-1, 1), density: int = 20)

Generate grid to project velocities.

Args:

xlim:: Grid bound on x axis
ylim:: Grid bound on y axis
density:: How much grid to split

Return:

grid_X, grid_Y, grid_XY

velocity_embedding_to_grid(pts: numpy.array, vel: numpy.array, nn: str:knn, radius='radius', grid_density: int = 20, n_neighbors: int = 4, radius: float = 2, xlim=(None, None), ylim=(None, None))

Project velocities to grid

Args:

pts:: UMAP/tSNE coordinates
vel:: Velocity vector
nn:: knn or radius neighbors to use
grid_density:: density of the grid
n_neighbors:: How much neighbors, works when nn==’knn’
radius:: How large radius, works when nn=’radius’
xlim:: Grid bound on x axis
ylim:: Grid bound on y axis

Return:

velocity_plot(pts, vel, ax, figtype: str:stream, grid, point='grid', nn: str:knn, radius='radius', grid_density: int = 20, n_neighbors: int = 4, radius: float = 2, streamdensity: float = 1.5, xlim=(None, None), ylim=(None, None), **kwargs)

Project velocities into embedding

Args:

pts:: UMAP/tSNE coordinates
vel:: Velocity vector
ax:: matplotlib.axes
figtype:: ‘stream’, ‘grid’ or ‘point’(single cell)
nn:: knn or radius neighbors to use
grid_density:: density of the grid
n_neighbors:: How much neighbors, works when nn==’knn’
radius:: How large radius, works when nn=’radius’
streamdensity:: Density of streamplot, works when figtype==stream
xlim:: Grid bound on x axis
ylim:: Grid bound on y axis

Return:

matplotlib.axes

mullerplot(data: numpy.ndarray, label: list, color: list, absolute: bool = 0, alpha: float = 0.8, ax: matplotlib.axes = None)

Draw mullerplot

Args:

data:: Population size array. rows for cell type, columns for time point
label:: Cell type names
color:: Colors list
absolute:: False: show frequency; True: show cell number
alpha:: [0-1], transparent
ax:: axes to draw mullerplot

Return:

matplotlib.axes

label_name(loc, cell_types, ax, fontsize=12, font='DejaVu Sans')

Label cell type names on figures.

Args:

loc:: x, y locations in embedding
cell_types:: Cell type names
ax:: axes to label cell type name
fontsize:: fontsize
font:: font

Return:

matplotlib.axes

corr_plot(x, y, ax, stats='pearson', r0_x=None, r0_y=None, r1_x=None, r1_y=None, fontsize=10)

Draw a scatter plot of the two sets of data and show their correlation coefficients

Args:

x:: data1
y:: data2
ax:: axes to draw scatter on
stats:: pearson or spearman
r0_x, r0_y, r1_x, r1_y:: locations to label the correlation coefficient and the p-value
fontsize:: fontsize

Return:

matplotlib.axes

class scData(count: pandas DataFrame = None, x_normed: pandas DataFrame = None, latent_z: pandas DataFrame = None, Xdr: pandas DataFrame = None, phylo_tree: phylo.tree = None, cell_states: list = None, cell_names: list = None, cell_generation: list = None, megs: list = None, velocity: list = None, velocity_embeded: list = None, phylo_pseudotime: list = None, pvals: list = None, qvals: list = None)

Data structure for PhyloVelo analysis

Args:: count: Read/UMI count. Index: cell names, columns: gene names x_normed: Normalized count. Index: cell names, columns: gene names latent_z: Inferenced latent expression Xdr: PCA/UMAP or tSNE coordinate, n cells * 2 phylo_tree: Phylogenetic tree cell_states: Cell types cell_names: Same to count’s index cell_generation: Generation time of cells megs: MEGs velocity: PhyloVelo velocity velocity_embeded: PhyloVelo velocity project into embedding phylo_pseudotime: Pseudotime inferenced by PhyloVelo

count = None

x_normed = None

latent_z = None

Xdr = None

cell_names = None

phylo_tree = None

cell_states = None

cell_generation = None

megs = None

velocity = None

velocity_embeded = None

phylo_pseudotime = None

pvals = None

qvals = None

drop_duplicate_genes(target='count')

Remove duplicated genes

Args:: target: count or x_normed

normalize_filter(is_normalize=True, is_log=True, min_count=10, target_sum=None)

normalize read/umi count and filter genes

Args:: is_normalize: Similiar to normalize_total in scanpy. True for normalize is_log: log(1+X) min_count: filter genes total count < min_count target_sum: if None, use median
Return:: self.x_normed

dimensionality_reduction(target: count, x_normed='count', method: pca, tsne, umap='tsne', n_components: int = 2, scale: float = 1, pc: bool = True, **kwags)

PCA/tSNE or UMAP

Args:: target: count method: use PCA/tSNE or UMAP n_components: Reduce the dimension to ‘n_components’ scale: normalize scale pc: Use PCA to tSNE/UMAP or not pc_components: How many PC to use when tSNE/UMAP perplexity: tSNE perplexity n_neighbors: UMAP n_neighbors min_dist: UMAP min_dist
Return:: self.Xdr

velocity_inference(sd: scData, time: list = None, cutoff: float = 0.97, alpha: float = 0.05, target: str = 'x_normed', exact: bool = False)

Inference phylogenetic velocity

Args:

sd:: scData
time:: if None, cell generation will be automatically calculated from phylo tree
cutoff:: Only calculate genes with top ‘cutoff’ correlation
alpha:: Significance level
target:: which data to inference, ‘count’ for nb model or ‘x_normed’ for normal model
exact:: True to use ‘is_meg’ function; False do not use

Return:

sd.velocity

velocity_embedding(sd: scData, target: str = 'count', n_neigh: int = None, chunk_size: int = None)

Project velocity into embedding

Args:

sd:: scData
target:: count or x_normed
n_neigh:: kNN pooling. Default: Ncells//3
chunk_size:: Number of cells per vectorized block. Default estimates a memory-safe size.

calc_meg_pseudotime(sd: scData, target: str = 'x_normed', genes: list = None, robust_quantiles: tuple = (0.05, 0.95), aggregation: str = 'median', min_genes: int = 3, query_data=None, query_sd: scData = None, query_target: str = None)

Calculate pseudotime directly from robustly oriented MEG expression.

Args:

sd:: sc data
target:: Expression matrix to use, usually ‘x_normed’ or ‘count’.
genes:: MEGs to use. Default uses sd.megs.
robust_quantiles:: Lower and upper quantiles used to clip per-gene expression.
aggregation:: ‘median’ for robust L1 aggregation or ‘weighted_mean’.
min_genes:: Minimum number of usable MEGs.
query_data:: Independent expression matrix (cells x genes) to score with the reference dataset’s MEGs, velocity directions, and robust scaling.
query_sd:: Independent scData object. Uses query_target or target as expression matrix and writes query pseudotime to query_sd.phylo_pseudotime.
query_target:: Expression matrix name for query_sd. Default: same as target.

Return:

sd if no query is provided; query pseudotime array if query_data is provided; query_sd if query_sd is provided.

calc_phylo_pseudotime(sd: scData, n_neighbors: int = 30, r_sample: float = 1, method: str = 'graph', target: str = 'x_normed', random_state: int = None)

Calculate the phyloVelo pseudotime

Args:

sd:: sc data
n_neighbors:: N nearest neighbors to build MST. The smaller the number, the faster the calculation, but there is a chance of error
r_sample:: [0-1], random sample a subset calculate pseudotime.
method:: ‘graph’ uses embedding velocities and a kNN MST; ‘meg’ uses robust MEG expression.
target:: Expression matrix used when method=’meg’.
random_state:: Seed for subsampling.

Return:

scData.phylo_pseudotime

class Gillespie(num_elements: int, inits: list = None, max_cell_num: int = 20000)

Gillespie simulation

Args:

num_elements:: Cell type number
inits:: Initial cell number
max_cell_num:: Maximum cell number

num_elements

reactions = []

anc_cells = []

curr_cells

generation_time = [0]

max_cell_num = 20000

add_reaction(rate: callable = None, num_lefts: list = None, num_rights: list = None, index: int = None)

Add reactions to simulation

Args:

rate:: reaction rate function
num_lefts:: Cell numbers before reaction
num_right:: Cell numbers after reaction
index:: Reaction index

evolute(steps: int)

Run simulation

Args:

steps:: How many steps to evolute before step