Metadata-Version: 2.1
Name: pyVIA
Version: 0.1.8
Summary: UNKNOWN
Home-page: https://github.com/ShobiStassen/VIA
Author-email: shobana.venkat88@gmail.com
License: MIT
Description: # Via
        VIA is a single-cell Trajectory Inference method that offers topology construction, pseudotimes, automated terminal state prediction and automated plotting of temporal gene dynamics along lineages. VIA combines lazy-teleporting random walks and Monte-Carlo Markov Chain simulations to overcome common challenges such as 1) accurate terminal state and lineage inference, 2) ability to capture combination of cyclic, disconnected and tree-like structures, 3) scalability in feature and sample space. It is also well-suited for multi-omic analysis. In addition to transcriptomic data, VIA works on scATAC-seq, flow and imaging cytometry data 
        
        ## Getting Started
        ### install using pip
        We recommend setting up a new conda environment
        ```
        conda create --name ViaEnv pip 
        pip install pyVIA // tested on linux
        ```
        ### install by cloning repository and running setup.py (ensure dependencies are installed)
        ```
        git clone https://github.com/ShobiStassen/VIA.git 
        python3 setup.py install // cd into the directory of the cloned PARC folder containing setup.py and issue this command
        ```
        
        ### install dependencies separately if needed (linux)
        If the pip install doesn't work, it usually suffices to first install all the requirements (using pip) and subsequently install VIA (also using pip)
        ```
        pip install python-igraph, leidenalg>=0.7.0, hnswlib, umap-learn, numpy>=1.17, scipy, pandas>=0.25, sklearn, termcolor, pygam, phate
        pip install pyVIA
        ```
        ## Examples
        ### 1.a Human Embryoid Bodies (wrapper function)
        ### 1.b Human Embryoid Bodies (Configuring VIA)
        ### 2.a Toy Data (multifurcation)
        ### 2.b Toy Data (disconnected)
        ### 3.a General input format and wrapper function
        ### 3.b General disconnected trajectories wrapper function
        ------------------------------------------------------
        ### 1.a Human Embryoid Bodies (wrapper function)
        save the [Raw data](https://drive.google.com/file/d/1yz3zR1KAmghjYB_nLLUZoIlKN9Ew4RHf/view?usp=sharing) matrix as 'EBdata.mat'. The cells in this file have been filtered for too small/large libraries by [Moon et al. 2019](https://nbviewer.jupyter.org/github/KrishnaswamyLab/PHATE/blob/master/Python/tutorial/EmbryoidBody.ipynb) 
        
        The function main_EB_clean() preprocesses the cells (normalized by library size, sqrt transformation). It then calls VIA to: plot the pseudotimes, terminal states, lineage pathways and gene-clustermap. The visualization method used in this function is PHATE.
        ```
        import pyVia.core as via
        via.main_EB_clean(ncomps=30, knn=20, p0_random_seed=20, foldername = '') # Most reasonable parameters of ncomps (10-200) and knn (15-50) work well
        ```
        ### 1.b Human Embryoid Bodies (Configuring VIA)
        If you wish to run the data using UMAP or TSNE (instead of PHATE), or require more control of the parameters/outputs, then use the following code:
        ```
        import pyVia.core as via
        #pre-process the data as needed and provide to via as a numpy array
        #root_user is the index of the cell corresponding to a suitable start/root cell
        
        v0 = VIA(input_data, time_labels, jac_std_global=0.15, dist_std_local=1, knn=knn,
                     too_big_factor=v0_too_big, root_user=1, dataset='EB', random_seed=v0_random_seed,
                     do_magic_bool=True, is_coarse=True, preserve_disconnected=True)  
        v0.run_VIA()
        
        
        tsi_list = get_loc_terminal_states(v0, input_data) #translate the terminal clusters found in v0 to the fine-grained run in v1
        
        v1 = VIA(input_data, time_labels, jac_std_global=0.15, dist_std_local=1, knn=knn,
                     too_big_factor=v1_too_big, super_cluster_labels=v0.labels, super_node_degree_list=v0.node_degree_list,
                     super_terminal_cells=tsi_list, root_user=1, is_coarse=False, full_neighbor_array=v0.full_neighbor_array,
                     full_distance_array=v0.full_distance_array, ig_full_graph=v0.ig_full_graph,
                     csr_array_locally_pruned=v0.csr_array_locally_pruned,
                     x_lazy=0.95, alpha_teleport=0.99, preserve_disconnected=True, dataset='EB',
                     super_terminal_clusters=v0.terminal_clusters, random_seed=21)
        v1.run_VIA()
        
        #Plot the true and inferred times and pseudotimes
        #Replace Y_phate with UMAP, TSNE embedding
        f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
        ax1.scatter(Y_phate[:, 0], Y_phate[:, 1], c=time_labels, s=5, cmap='viridis', alpha=0.5)
        ax2.scatter(Y_phate[:, 0], Y_phate[:, 1], c=v1.single_cell_pt_markov, s=5, cmap='viridis', alpha=0.5)
        ax1.set_title('Embyroid Data: Days')
        ax2.set_title('Embyroid Data: VIA')
        plt.show()
        
        #obtain the single-cell locations of the terminal clusters to be used for visualization of trajectories/lineages 
        super_clus_ds_PCA_loc = via.sc_loc_ofsuperCluster_PCAspace(v0, v1, np.arange(0, len(v1.labels)))
        #draw the overall lineage paths on the embedding
        draw_trajectory_gams(Y_phate, super_clus_ds_PCA_loc, v1.labels, v0.labels, v0.edgelist_maxout,
                                 v1.x_lazy, v1.alpha_teleport, v1.single_cell_pt_markov, time_labels, knn=v0.knn,
                                 final_super_terminal=v1.revised_super_terminal_clusters,
                                 sub_terminal_clusters=v1.terminal_clusters,
                                 title_str='Pseudotime and path', ncomp=ncomps)
        
        2D_knn_hnsw = via.make_knn_embeddedspace(Y_phate) #used to visualize the path obtained in the high-dimensional KNN
        #draw the individual lineage paths and cell-fate probabilities at single-cell level 
        via.draw_sc_evolution_trajectory_dijkstra(v1, Y_phate, 2D_knn_hnsw, v0.full_graph_shortpath,
                                              idx=np.arange(0, input_data.shape[0]))
        plt.show()
        ```
        ![Output of VIA on Human Embryoid](https://github.com/ShobiStassen/VIA/blob/master/Figures/EB_fig1.png?raw=true)
        
        
        ### 2.a/b Toy data (Multifurcation and Disconnected)
        Two examples [toy datasets](https://drive.google.com/drive/folders/1WQSZeNixUAB1Sm0Xf68ZnSLQXyep936l?usp=sharing) with annotations are generated using DynToy are provided. 
        ```
        import pyVia.core as via
        #multifurcation
        #the root is automatically set to  root_user = 'M1'
        via.main_Toy(ncomps=10, knn=30,dataset='Toy3', random_seed=2,foldername = ".../Trajectory/Datasets/") #multifurcation
        #disconnected trajectory
        #the root is automatically set as a list root_user = ['T1_M1', 'T2_M1'] # e.g. T2_M3 is a cell belonging to the 3rd Milestone (M3) of the second Trajectory (T2)
        via.main_Toy(ncomps=10, knn=30,dataset='Toy4',random_seed=2,foldername =".../Trajectory/Datasets/") #2 disconnected trajectories
        ```
        ## Output of Multifurcating toy dataset
        ![Output of VIA multifurcating toy dataset](https://github.com/ShobiStassen/VIA/blob/master/Figures/Toy3_fig0.png?raw=true)
        ## Output of disconnected toy dataset
        ![Output of VIA on disconnected toy dataset](https://github.com/ShobiStassen/VIA/blob/master/Figures/Toy4_fig0.png?raw=true)
        
        ### 3.a General input format and wrapper function (uses example of pre-B cell differentiation) 
        Datasets and labels used in this example are provided in [Datasets](https://github.com/ShobiStassen/VIA/tree/master/Datasets).
        
        ```
        # Read the two files:
        # 1) the first file contains 200PCs of the Bcell filtered and normalized data for the first 5000 HVG.
        # 2)The second file contains raw count data for marker genes
        
        data = pd.read_csv('./Bcell_200PCs.csv')
        data_genes = pd.read_csv('./Bcell_markergenes.csv')
        data_genes = data_genes.drop(['cell'], axis=1)
        true_label = data['time_hour']
        data = data.drop(['cell', 'time_hour'], axis=1)
        adata = sc.AnnData(data_genes)
        adata.obsm['X_pca'] = data.values
        
        # use UMAP or PHate to obtain embedding that is used for single-cell level visualization
        embedding = umap.UMAP(random_state=42, n_neighbors=15, init='random').fit_transform(data.values[:, 0:5])
        
        # list marker genes or genes of interest if known in advance. otherwise marker_genes = []
        marker_genes = ['Igll1', 'Myc', 'Slc7a5', 'Ldha', 'Foxo1', 'Lig4', 'Sp7']  # irf4 down-up
        # call VIA. We identify an early (suitable) start cell root = [42]. Can also set an arbitrary value
        via.via_wrapper(adata, true_label, embedding, knn=20, ncomps=20, jac_std_global=0.15, root=[42], dataset='',
                    random_seed=1,v0_toobig=0.3, v1_toobig=0.1, marker_genes=marker_genes)
        ```
        ### 3.b VIA wrapper for generic disconnected trajectory
        ```
        #foldername corresponds to the location where you have saved the Toy Disconnected data (shown in example 2)
        #Read in the data and labels
        df_counts = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000.csv", 'rt', delimiter=",")
        df_ids = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000_ids.csv", 'rt', delimiter=",")
        
        # Make AnnData object for wrapper function to read-in data and do PCA
        df_ids['cell_id_num'] = [int(s[1::]) for s in df_ids['cell_id']]
        df_counts = df_counts.drop('Unnamed: 0', 1)
        df_ids = df_ids.sort_values(by=['cell_id_num'])
        df_ids = df_ids.reset_index(drop=True)
        true_label = df_ids['group_id']
        adata_counts = sc.AnnData(df_counts, obs=df_ids)
        sc.tl.pca(adata_counts, svd_solver='arpack', n_comps=ncomps)
        
        #Since there are 2 disconnected trajectories, we provide 2 arbitrary roots (start cells).If there are more disconnected paths, then VIA arbitrarily selects roots. #The root can also just be arbitrarily set as [1] and VIA can detect how many additional roots it must add
        via_wrapper_disconnected(adata_counts, true_label, embedding=adata_counts.obsm['X_pca'][:, 0:2], root=[1,1], preserve_disconnected=True, knn=30, ncomps=10,cluster_graph_pruning_std = 1)
        
        #in the case of connected data (i.e. only 1 graph component. e.g. Toy Data Multifurcating) then the wrapper function from example 3.a can be used:
        #via_wrapper(adata_counts, true_label, embedding=  adata_counts.obsm['X_pca'][:,0:2], root=[1], knn=30, ncomps=10,cluster_graph_pruning_std = 1)
        ```
Platform: UNKNOWN
Description-Content-Type: text/markdown
