API Tutorial: Full pyInfinityFlow Pipeline

This notebook can be downloaded here.

This tutorial uses the pyInfinityFlow API to carry out the full analysis pipeline with an example dataset. This example is a subset of the previously published mouse lung dataset[1], the full data set was made publicly available here in flowrepository.org. You can download the subset with the pyInfinityFlow repository on GitHub, which consists of 10 InfinityMarkers and 5 Isotype controls located in the ‘example_dataset’ directory. This directory also contains the relevant InfinityMarker annotation file as well as the Backbone annotation file, which are necessary for the analysis pipeline.

You can download the repository after Git has been installed by changing directories to where you want to install it and by using the following command:

git clone https://github.com/KyleFerchen/pyInfinityFlow.git

Provide paths for your machine

After you have installed the GitHub repository, you can add the path to the repository below to run this Notebook on your machine:

[ ]:
# Specify the path to the repository
path_to_repo = "/media/kyle_ssd1/Repositories/pyInfinityFlow/"
# Specify which directory on your machine you want to save the results
my_output_dir = "/media/kyle_ssd1/outputs/"

Step 1: Preparing the Inputs

Backbone Annotation File

First, we need to locate the Backbone annotation file. This will instruct the program which channel names in the input FCS files to use as the Backbone (predictors in the regression model). This is simply a .csv or .tsv file with three columns (in the same order as below) to annotate:

  1. The channel names in the reference FCS file(s)(the data we use to build the final InfinityFlow object)

  2. The channel names in the InfinityMarker FCS files (the data use to fit and validate the models)

  3. The final name to use for the channel in the InfinityFlow object

This file should have the column names as the first line.

After downloading the pyInfinityFlow package repository on GitHub, we can access an example file for this test dataset, Eg.:

'/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset_backbone_anno.csv'

The pyInfinityFlow.InfinityFlow_Utilities module provides a simple function (read_annotation_table) to read either a .csv, .tsv, or .txt (tab-delimited) file into a pandas.DataFrame object:

[1]:
import os
from pyInfinityFlow import InfinityFlow_Utilities

# PROVIDE THE PATH TO WHERE YOU DOWNLOADED THE REPOSITORY
path_backbone = os.path.join(path_to_repo, "example_data/mouse_lung_dataset_subset_backbone_anno.csv")
backbone_anno = InfinityFlow_Utilities.read_annotation_table(path_backbone)
backbone_anno
[1]:
Reference_Backbone Query_Backbone Final_Name
0 FJComp-APC-A FJComp-APC-A CD69-CD301b
1 FJComp-AlexaFluor700-A FJComp-AlexaFluor700-A MHCII
2 FJComp-BUV395-A FJComp-BUV395-A CD4
3 FJComp-BUV737-A FJComp-BUV737-A CD44
4 FJComp-BV421-A FJComp-BV421-A CD8
5 FJComp-BV510-A FJComp-BV510-A CD11c
6 FJComp-BV605-A FJComp-BV605-A CD11b
7 FJComp-BV650-A FJComp-BV650-A F480
8 FJComp-BV711-A FJComp-BV711-A Ly6C
9 FJComp-BV786-A FJComp-BV786-A Lineage
10 FJComp-GFP-A FJComp-GFP-A CD45a488
11 FJComp-PE-Cy7(yg)-A FJComp-PE-Cy7(yg)-A CD24
12 FJComp-PerCP-Cy5-5-A FJComp-PerCP-Cy5-5-A CD103

InfinityMarker Annotation File

The InfinityMarker annotation file specifies what FCS files to use to build the regression models and how they should be treated. Each InfinityMarker (Flow Cytometry signal to impute using the backbone) has a row entry in this annotation file for the following columns:

  1. The FCS file name

  2. The InfinityMarker channel name (exactly as it appears in the FCS file)

  3. The name to give the channel in the final InfinityFlow object

  4. (OPTIONAL) The final name of Isotype InfinityMarker (should be an entry in the third column for the InfinityMarkers that are Isotype controls)

This file is included in the same directory as the Backbone annotation file in the GitHub repository, Eg.:

'/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset_infinity_marker_anno.csv'

Isotype background correction is an optional step in which a linear model is used to regress out the background binding and fluorescence of an antibody raised with a specific immunoglobulin. You can read more about it from the original publication. The InfinityMarker annotation file is used to specify whether or not to perform background correction. This is optional and will only be attempted in the pipeline if this annotation file has a 4th column.

The InfinityMarker annotation file, like the Backbone annotation file, is expected to be either a .csv, .tsv, or .txt (tab-delimited) file, and can also be read into a pandas.DataFrame using the read_annotation_table function:

[2]:
path_infmarker = os.path.join(path_to_repo,
    "example_data/mouse_lung_dataset_subset_infinity_marker_anno.csv")
infinitymarker_anno = InfinityFlow_Utilities.read_annotation_table(path_infmarker)
infinitymarker_anno
[2]:
File Channel Name Isotype
0 backbone_Plate2_Specimen_001_G1_G01_073_target... FJComp-PE(yg)-A 33D1 Isotype_rIgG2b
1 backbone_Plate2_Specimen_001_F7_F07_067_target... FJComp-PE(yg)-A Allergin-1 Isotype_mIgG1
2 backbone_Plate2_Specimen_001_F8_F08_068_target... FJComp-PE(yg)-A B7-H4 Isotype_AHIgG
3 backbone_Plate1_Specimen_001_A2_A02_002_target... FJComp-PE(yg)-A CD1d Isotype_rIgG2b
4 backbone_Plate1_Specimen_001_G4_G04_076_target... FJComp-PE(yg)-A CD103 Isotype_AHIgG
5 backbone_Plate1_Specimen_001_G5_G05_077_target... FJComp-PE(yg)-A CD105 Isotype_rIgG2a
6 backbone_Plate1_Specimen_001_G6_G06_078_target... FJComp-PE(yg)-A CD106 Isotype_rIgG2a
7 backbone_Plate1_Specimen_001_G7_G07_079_target... FJComp-PE(yg)-A CD107a (Lamp-1) Isotype_rIgG2a
8 backbone_Plate1_Specimen_001_G8_G08_080_target... FJComp-PE(yg)-A CD107b (Mac-3) Isotype_rIgG1
9 backbone_Plate1_Specimen_001_G9_G09_081_target... FJComp-PE(yg)-A CD115 Isotype_rIgG2a
10 backbone_Plate3_Specimen_001_F12_F12_072_targe... FJComp-PE(yg)-A Isotype_rIgG2b Isotype_rIgG2b
11 backbone_Plate3_Specimen_001_F6_F06_066_target... FJComp-PE(yg)-A Isotype_mIgG1 Isotype_mIgG1
12 backbone_Plate3_Specimen_001_F4_F04_064_target... FJComp-PE(yg)-A Isotype_AHIgG Isotype_AHIgG
13 backbone_Plate3_Specimen_001_F11_F11_071_targe... FJComp-PE(yg)-A Isotype_rIgG2a Isotype_rIgG2a
14 backbone_Plate3_Specimen_001_F10_F10_070_targe... FJComp-PE(yg)-A Isotype_rIgG1 Isotype_rIgG1

Step 2: Checking the Inputs and Building an InfinityFlowFileHandler

Next, we need to specify the directory in which the FCS files are saved. This directory is located in the same parent directory as the annotation files on the pyInfinityFlow GitHub repository:

'/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset'

Then we can use the check_infinity_flow_annotation_dataframes to do the following:

  • Validate the input annotation DataFrames

  • Scan through the InfinityMarker FCS files to split events into training/validation/pooling subsets

  • Return an InfinityFlowFileHandler to store how each of the InfinityMarker files will be processed

Here we will use the n_events_combine parameter to pool events from each of the individual InfinityMarker files for the final InfinityFlow object. Each of original channels from this file will be preserved into the final InfinityFlow object.

Note: it is also possible to use the separate_backbone_reference argument to supply a separate FCS file onto which the predictions will be made. This is useful if there is a feature(s) that is not well explained by the Backbone channels and therefore should not be imputed.

[3]:
fcs_dir = os.path.join(path_to_repo,
    "example_data/mouse_lung_dataset_subset")

file_handler = InfinityFlow_Utilities.check_infinity_flow_annotation_dataframes(\
    backbone_annotation=backbone_anno,
    infinity_marker_annotation=infinitymarker_anno,
    n_events_train=0, # Use all possible events in the FCS file
    n_events_validate=0, # Use all possible events in the FCS file
    ratio_for_validation=0.5,
    n_events_combine=1000, # Events to pool into a final InfinityFlow object
    input_fcs_dir=fcs_dir,
    verbosity=1)

file_handler
Isotype controls detected. Will attempt to use background correction...
[3]:
InfinityFlowFileHandler Object from pyInfinityFlow
        .handles the following InfinityMarkers:
                        33D1
                        Allergin-1
                        B7-H4
                        CD1d
                        CD103
                        CD105
                        CD106
                        CD107a (Lamp-1)
                        CD107b (Mac-3)
                        CD115
                        Isotype_rIgG2b
                        Isotype_mIgG1
                        Isotype_AHIgG
                        Isotype_rIgG2a
                        Isotype_rIgG1

        Held in the InfinityFlowFileHandler.handles dictionary

        InfinityFlowFileHandler.list_infinity_markers holds ordered list of InfinityMarkers

For example, you can see how the InfinityMarker “33D1” is stored in the file_handler.handles dictionary, including the name, file_name, directory, reference_backbone_channels, backbone_channels, prediction_channel, train_indices, test_indices, and pool_indices.

This information will be used later on to carry out XGBoost regression.

[4]:
file_handler.handles["33D1"]
[4]:
{'name': '33D1',
 'file_name': 'backbone_Plate2_Specimen_001_G1_G01_073_target_33D1.fcs',
 'directory': '/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset',
 'reference_backbone_channels': array(['FJComp-APC-A', 'FJComp-AlexaFluor700-A', 'FJComp-BUV395-A',
        'FJComp-BUV737-A', 'FJComp-BV421-A', 'FJComp-BV510-A',
        'FJComp-BV605-A', 'FJComp-BV650-A', 'FJComp-BV711-A',
        'FJComp-BV786-A', 'FJComp-GFP-A', 'FJComp-PE-Cy7(yg)-A',
        'FJComp-PerCP-Cy5-5-A'], dtype=object),
 'backbone_channels': array(['FJComp-APC-A', 'FJComp-AlexaFluor700-A', 'FJComp-BUV395-A',
        'FJComp-BUV737-A', 'FJComp-BV421-A', 'FJComp-BV510-A',
        'FJComp-BV605-A', 'FJComp-BV650-A', 'FJComp-BV711-A',
        'FJComp-BV786-A', 'FJComp-GFP-A', 'FJComp-PE-Cy7(yg)-A',
        'FJComp-PerCP-Cy5-5-A'], dtype=object),
 'prediction_channel': 'FJComp-PE(yg)-A',
 'train_indices': array([     0,      1,      2, ..., 106341, 106343, 106345]),
 'test_indices': array([     5,      7,      9, ..., 106342, 106344, 106346]),
 'pool_indices': array([     5,    113,    137,    375,    430,    474,    527,    709,
           914,    930,   1006,   1026,   1184,   1229,   1230,   1315,
          1449,   1867,   2042,   2122,   2287,   2293,   2363,   2397,
          2519,   2566,   2657,   2847,   2963,   3030,   3383,   3484,
          3721,   3769,   4125,   4788,   4899,   4909,   4913,   5025,
          5032,   5034,   5053,   5065,   5151,   5359,   5409,   5434,
          5565,   5601,   5608,   5612,   5677,   6171,   6180,   6203,
          6334,   6364,   6447,   6456,   6639,   6641,   6745,   6767,
          6872,   7009,   7048,   7083,   7103,   7104,   7421,   7662,
          7668,   7782,   7863,   7872,   7961,   8111,   8142,   8345,
          8373,   8374,   8390,   8419,   8455,   8496,   8575,   8596,
          8688,   8753,   8792,   8862,   8967,   8984,   9119,   9149,
          9160,   9638,   9775,   9863,   9902,   9940,  10063,  10090,
         10093,  10316,  10558,  10564,  10686,  10842,  10910,  10969,
         11016,  11027,  11039,  11108,  11123,  11211,  11340,  11610,
         11642,  11698,  11709,  11734,  11750,  11768,  12101,  12356,
         12621,  12676,  12751,  12812,  12934,  12976,  12993,  13005,
         13010,  13135,  13224,  13278,  13345,  13462,  13566,  13768,
         13820,  13844,  13866,  13888,  13961,  14266,  14487,  14800,
         14918,  15042,  15404,  15442,  15508,  15525,  15725,  15816,
         15848,  15893,  15938,  16336,  16361,  16391,  16423,  16475,
         16578,  16771,  16834,  17099,  17131,  17283,  17496,  17603,
         17639,  17678,  17705,  17764,  18073,  18080,  18109,  18313,
         18377,  18470,  18484,  19001,  19067,  19093,  19141,  19194,
         19206,  19218,  19536,  19643,  19712,  19953,  19975,  19996,
         20260,  20318,  20348,  20432,  20501,  20613,  20641,  20712,
         20789,  20891,  21029,  21037,  21092,  21159,  21338,  21363,
         21503,  21733,  21803,  21913,  22288,  22476,  22674,  22817,
         22946,  23023,  23159,  23444,  23695,  23707,  23810,  23882,
         24021,  24039,  24042,  24194,  24221,  24381,  24400,  24563,
         24614,  24897,  25054,  25148,  25182,  25204,  25458,  25482,
         25496,  25726,  25875,  25901,  25957,  26197,  26759,  26855,
         26877,  26883,  27101,  27120,  27246,  27398,  27501,  27521,
         27524,  27792,  27842,  27867,  27896,  27898,  27901,  28170,
         28357,  28444,  28619,  28687,  28778,  28894,  28984,  29008,
         29047,  29092,  29187,  29263,  29306,  29386,  29518,  29646,
         29734,  30043,  30091,  30095,  30122,  30213,  30420,  30658,
         31065,  31079,  31082,  31396,  31473,  31478,  31519,  31663,
         31698,  31922,  31971,  32165,  32238,  32302,  32341,  32459,
         32592,  32767,  32776,  32951,  33038,  33377,  33418,  33560,
         33735,  33744,  33996,  34037,  34069,  34071,  34252,  34344,
         34388,  34533,  34640,  34717,  34758,  34993,  35056,  35170,
         35214,  35245,  35298,  35427,  35438,  35546,  35757,  35830,
         36162,  36173,  36203,  36345,  36420,  36481,  36598,  36787,
         36822,  36840,  36846,  36992,  36996,  37102,  37154,  37238,
         37500,  37520,  37620,  37669,  37715,  38021,  38082,  38170,
         38371,  38408,  38438,  38457,  38477,  38579,  38954,  38956,
         39064,  39188,  39201,  39327,  39669,  39752,  39807,  39910,
         39957,  39999,  40082,  40106,  40158,  40287,  40297,  40514,
         40583,  40584,  40723,  40811,  41199,  41257,  41374,  41590,
         41613,  41638,  41658,  41887,  41901,  41916,  41966,  42098,
         42249,  42285,  42493,  42557,  42565,  42665,  42673,  42739,
         42796,  43117,  43287,  43362,  43377,  43667,  43707,  43731,
         43758,  43795,  43931,  43981,  43989,  44042,  44064,  44699,
         44740,  44895,  44913,  44986,  45058,  45168,  45274,  45362,
         45364,  45461,  45508,  45566,  45826,  45998,  46319,  46332,
         46366,  46505,  46524,  46698,  46797,  46898,  46968,  47453,
         47494,  47631,  47683,  47693,  47747,  47882,  48119,  48162,
         48170,  48218,  48306,  48325,  48374,  48488,  48530,  48549,
         48560,  48618,  48702,  48714,  48745,  49034,  49084,  49114,
         49180,  49243,  49246,  49309,  49349,  49433,  49434,  49506,
         49575,  49779,  49846,  49910,  49966,  49973,  50174,  50482,
         50647,  50679,  50728,  50730,  50808,  50910,  50998,  51184,
         51299,  51431,  51557,  51655,  51674,  51682,  51858,  51940,
         51941,  51960,  51982,  52153,  52225,  52318,  52333,  52613,
         52692,  52850,  52870,  53000,  53064,  53136,  53193,  53197,
         53279,  53323,  53355,  53410,  53463,  53686,  53713,  54210,
         54381,  54382,  54537,  54594,  54688,  54929,  55316,  55389,
         55501,  55509,  55564,  55652,  55667,  55765,  55888,  56108,
         56256,  56408,  56478,  56620,  56935,  57082,  57131,  57477,
         57593,  57635,  57649,  57671,  57902,  57996,  58103,  58214,
         58230,  58288,  58367,  58693,  58749,  58822,  59031,  59053,
         59132,  59164,  59201,  59308,  59532,  59759,  59871,  59935,
         59991,  60169,  60379,  60465,  60483,  60550,  60666,  60670,
         60732,  60757,  61152,  61168,  61192,  61254,  61530,  61767,
         61905,  61931,  62066,  62108,  62545,  62752,  62939,  63084,
         63200,  63219,  63253,  63290,  63330,  63368,  63682,  63801,
         63874,  63933,  63963,  64035,  64036,  64131,  64164,  64350,
         64598,  64661,  64663,  64701,  64934,  64947,  65118,  65286,
         65330,  65401,  65657,  65661,  66020,  66037,  66263,  66747,
         66769,  66780,  66906,  67009,  67038,  67124,  67192,  67357,
         67379,  67389,  67446,  67494,  67612,  67712,  67843,  67892,
         67927,  67940,  68148,  68172,  68266,  68286,  68349,  68414,
         68601,  68607,  68653,  68747,  68803,  68865,  68903,  68909,
         69207,  69236,  69370,  69394,  69411,  69502,  69619,  69720,
         69833,  69885,  69956,  70023,  70056,  70296,  70465,  70589,
         70641,  70686,  71028,  71113,  71180,  71200,  71203,  71587,
         71649,  71676,  72106,  72159,  72200,  72261,  72271,  72313,
         72537,  72557,  72853,  73155,  73275,  73504,  73812,  74128,
         74181,  74430,  74627,  74801,  74824,  74909,  75083,  75210,
         75801,  75828,  75842,  75991,  76442,  76644,  76776,  76971,
         76986,  77154,  77396,  77775,  77800,  77803,  77853,  78191,
         78270,  78439,  78594,  78611,  78618,  78696,  78911,  78928,
         79016,  79090,  79175,  79313,  79438,  79541,  79649,  79783,
         80150,  80173,  80491,  80647,  80718,  80725,  80860,  80869,
         80891,  80944,  80958,  81008,  81015,  81020,  81111,  81271,
         81313,  81386,  81419,  81586,  81622,  81670,  81791,  81807,
         81988,  82060,  82092,  82336,  82516,  82523,  82526,  82541,
         82554,  82564,  82630,  82636,  82687,  82750,  82918,  82966,
         83507,  84850,  84952,  84964,  85136,  85222,  85231,  85388,
         85446,  85529,  85850,  85883,  85952,  86032,  86096,  86169,
         86188,  86352,  86472,  86712,  86713,  86823,  86938,  87129,
         87309,  87413,  87529,  87723,  88053,  88087,  88090,  88157,
         88161,  88258,  88265,  88273,  88572,  88754,  88780,  88821,
         88915,  88965,  88988,  88998,  89040,  89062,  89245,  89279,
         89633,  89647,  89827,  89941,  90003,  90034,  90256,  90414,
         90583,  90587,  90885,  91190,  91352,  91410,  91502,  91599,
         91685,  91707,  91711,  91729,  91894,  92030,  92081,  92277,
         92302,  92386,  92476,  92516,  92755,  92879,  92959,  93072,
         93115,  93121,  93276,  93646,  93790,  93837,  93852,  94023,
         94026,  94229,  94267,  94491,  94588,  94771,  94899,  94916,
         95014,  95079,  95093,  95100,  95220,  95242,  95314,  95328,
         95410,  95608,  95668,  95683,  95776,  95853,  95891,  95978,
         96222,  96225,  96242,  96299,  96349,  96417,  96508,  96613,
         96895,  96917,  96947,  96979,  97178,  97287,  97324,  97341,
         97457,  97531,  97542,  97575,  97682,  97719,  97769,  97836,
         97976,  98127,  98328,  98417,  98436,  98446,  98518,  98674,
         98721,  98857,  98886,  98910,  99168,  99190,  99206,  99320,
         99362,  99703,  99845,  99883,  99902,  99904,  99932,  99956,
         99978, 100203, 100330, 100416, 100450, 100589, 100685, 100735,
        100755, 100761, 100803, 100909, 100916, 101335, 101993, 102036,
        102204, 102233, 102292, 102345, 102459, 102661, 102672, 102776,
        102835, 102841, 102862, 102989, 102993, 103183, 103375, 103732,
        103798, 103876, 104273, 104297, 104317, 104324, 104685, 105143,
        105145, 105193, 105404, 105481, 105886, 105975, 105979, 106099])}

Step 3: Specify Output Directories

Here, we simply need to specify a directory in which to save the outputs of the pipeline. The InfinityFlow_Utilities.setup_output_directories function will prepare a dictionary that stores where to save different outputs, and create those directories:

[5]:
output_paths = InfinityFlow_Utilities.setup_output_directories(\
    output_dir=my_output_dir,
    file_handler=file_handler,
    verbosity=1)

output_paths
[5]:
{'output_regression_path': '/media/kyle_ssd1/outputs/regression_results',
 'output_umap_feature_plot_path': '/media/kyle_ssd1/outputs/umap_feature_plots',
 'clustering': '/media/kyle_ssd1/outputs/clustering',
 'qc': '/media/kyle_ssd1/outputs/QC',
 'output_umap_bc_feature_plot_path': '/media/kyle_ssd1/outputs/umap_feature_plots_background_corrected'}

Step 4: Fitting the XGBoost Regression Models

The InfinityFlow_Utilities.single_chunk_training function is used to create and fit the XGBoost models. It will return a tuple consisting of a InfinityFlow_Utilities.CombinedRegressionModels object and a dictionary that saves how much time it took to fit the models for the InfinityMarkers.

[6]:
regression_models, timings_1 = InfinityFlow_Utilities.single_chunk_training(\
    file_handler=file_handler,
    cores_to_use=12,
    use_logicle_scaling=True,
    normalization_method=None,
    verbosity=3)
Reading in data from .fcs files for model training...
DEBUG:          Reading in the data for InfinityMarker 33D1...
DEBUG:          Reading in the data for InfinityMarker Allergin-1...
DEBUG:          Reading in the data for InfinityMarker B7-H4...
DEBUG:          Reading in the data for InfinityMarker CD1d...
DEBUG:          Reading in the data for InfinityMarker CD103...
DEBUG:          Reading in the data for InfinityMarker CD105...
DEBUG:          Reading in the data for InfinityMarker CD106...
DEBUG:          Reading in the data for InfinityMarker CD107a (Lamp-1)...
DEBUG:          Reading in the data for InfinityMarker CD107b (Mac-3)...
DEBUG:          Reading in the data for InfinityMarker CD115...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG2b...
DEBUG:          Reading in the data for InfinityMarker Isotype_mIgG1...
DEBUG:          Reading in the data for InfinityMarker Isotype_AHIgG...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG2a...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG1...
Applying Logicle normalization to data...
        Building regression model for 33D1...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 4.51 seconds.

        Building regression model for Allergin-1...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 4.49 seconds.

        Building regression model for B7-H4...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.65 seconds.

        Building regression model for CD1d...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 6.50 seconds.

        Building regression model for CD103...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 5.05 seconds.

        Building regression model for CD105...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.90 seconds.

        Building regression model for CD106...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.28 seconds.

        Building regression model for CD107a (Lamp-1)...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 6.44 seconds.

        Building regression model for CD107b (Mac-3)...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 4.62 seconds.

        Building regression model for CD115...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.80 seconds.

        Building regression model for Isotype_rIgG2b...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.79 seconds.

        Building regression model for Isotype_mIgG1...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.76 seconds.

        Building regression model for Isotype_AHIgG...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.75 seconds.

        Building regression model for Isotype_rIgG2a...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.71 seconds.

        Building regression model for Isotype_rIgG1...
DEBUG: Setting n_jobs to 12 and random_state to None
DEBUG:          XGBoost regression model trained in 3.46 seconds.

[7]:
regression_models
[7]:
CombinedRegressionModels Object from pyInfinityFlow
        Contains regression models for the following InfinityMarkers (Response Variables):
33D1,Allergin-1,B7-H4,CD1d,CD103,CD105,CD106,CD107a (Lamp-1),CD107b (Mac-3),CD115,Isotype_rIgG2b,Isotype_mIgG1,Isotype_AHIgG,Isotype_rIgG2a,Isotype_rIgG1

        Uses the following backbone (Explanatory Variables):
FJComp-APC-A,FJComp-AlexaFluor700-A,FJComp-BUV395-A,FJComp-BUV737-A,FJComp-BV421-A,FJComp-BV510-A,FJComp-BV605-A,FJComp-BV650-A,FJComp-BV711-A,FJComp-BV786-A,FJComp-GFP-A,FJComp-PE-Cy7(yg)-A,FJComp-PerCP-Cy5-5-A

The object holds the following variables:
        ordered_training_channels
        var_annotations
        infinity_markers
        regression_models
        parameter_annotations
        infinity_channels
        validation_metrics

        Access regression models as dictionary with the InfinityMarker as the key:
                Eg. CombinedRegressionModels.regression_models["33D1"]

Step 5: Validating Regression Models

We can next use held out data from each of the InfinityMarker FCS files to score how well each of the models is able to impute the InfinityMarker expression values with the Backbone features. This is done with the InfinityFlow_Utilities.single_chunk_testing function. This will return a tuple with an updated CombinedRegressionModels object that contains validation metrics, and a dictionary to track the timing of the validation.

[8]:
regression_models, timings_2 = InfinityFlow_Utilities.single_chunk_testing(\
    file_handler = file_handler,
    regression_models = regression_models,
    use_logicle_scaling=True,
    normalization_method=None,
    verbosity=3)
Reading in data from .fcs files for model validation...
DEBUG:          Reading in the data for InfinityMarker 33D1...
DEBUG:          Reading in the data for InfinityMarker Allergin-1...
DEBUG:          Reading in the data for InfinityMarker B7-H4...
DEBUG:          Reading in the data for InfinityMarker CD1d...
DEBUG:          Reading in the data for InfinityMarker CD103...
DEBUG:          Reading in the data for InfinityMarker CD105...
DEBUG:          Reading in the data for InfinityMarker CD106...
DEBUG:          Reading in the data for InfinityMarker CD107a (Lamp-1)...
DEBUG:          Reading in the data for InfinityMarker CD107b (Mac-3)...
DEBUG:          Reading in the data for InfinityMarker CD115...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG2b...
DEBUG:          Reading in the data for InfinityMarker Isotype_mIgG1...
DEBUG:          Reading in the data for InfinityMarker Isotype_AHIgG...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG2a...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG1...
Applying Logicle normalization to data...
Obtaining validation metrics for regression models...
                Working on 33D1...
                Working on Allergin-1...
                Working on B7-H4...
                Working on CD1d...
                Working on CD103...
                Working on CD105...
                Working on CD106...
                Working on CD107a (Lamp-1)...
                Working on CD107b (Mac-3)...
                Working on CD115...
                Working on Isotype_rIgG2b...
                Working on Isotype_mIgG1...
                Working on Isotype_AHIgG...
                Working on Isotype_rIgG2a...
                Working on Isotype_rIgG1...

The single_chunk_testing function will set a dictionary to the validation_metrics attribute of the CombinedRegressionModels object. For each InfinityMarker name as a key, a dictionary is stored as the value with the predicted values, ture values, r2_score, and mean_squared_error:

[9]:
regression_models.validation_metrics
[9]:
{'33D1': {'pred': array([0.25200623, 0.2564635 , 0.25081083, ..., 0.25893834, 0.23685327,
         0.25687218], dtype=float32),
  'true': array([0.25880677, 0.24245095, 0.25493586, ..., 0.274207  , 0.21523006,
         0.24371576], dtype=float32),
  'r2_score': 0.19209259889025,
  'mean_squared_error': 0.0001764585},
 'Allergin-1': {'pred': array([0.21301892, 0.21872343, 0.20963453, ..., 0.21750394, 0.22308473,
         0.21142632], dtype=float32),
  'true': array([0.26102197, 0.2454918 , 0.24498476, ..., 0.21617137, 0.22679122,
         0.23546468], dtype=float32),
  'r2_score': 0.5829876350462362,
  'mean_squared_error': 0.00061123044},
 'B7-H4': {'pred': array([0.22862323, 0.2501482 , 0.25119272, ..., 0.2621282 , 0.2515326 ,
         0.24431813], dtype=float32),
  'true': array([0.2212211 , 0.22892201, 0.24515927, ..., 0.25912946, 0.23800558,
         0.23579723], dtype=float32),
  'r2_score': 0.17660769306807023,
  'mean_squared_error': 0.00022054944},
 'CD1d': {'pred': array([0.27452454, 0.27784905, 0.3807743 , ..., 0.26960534, 0.27109462,
         0.26869002], dtype=float32),
  'true': array([0.29084668, 0.2787824 , 0.37280142, ..., 0.2871778 , 0.2708428 ,
         0.25334674], dtype=float32),
  'r2_score': 0.6454607421084098,
  'mean_squared_error': 0.0010725937},
 'CD103': {'pred': array([0.25955388, 0.25041333, 0.2822778 , ..., 0.2555274 , 0.23480846,
         0.23861615], dtype=float32),
  'true': array([0.26624402, 0.21703161, 0.25556582, ..., 0.24584499, 0.23774952,
         0.26314855], dtype=float32),
  'r2_score': 0.6333754867688306,
  'mean_squared_error': 0.00029186436},
 'CD105': {'pred': array([0.25279748, 0.26322943, 0.33079505, ..., 0.2977552 , 0.24908838,
         0.23428737], dtype=float32),
  'true': array([0.25490686, 0.27640045, 0.3742214 , ..., 0.2684391 , 0.24985689,
         0.21935181], dtype=float32),
  'r2_score': 0.6753359632935965,
  'mean_squared_error': 0.0004727315},
 'CD106': {'pred': array([0.24811669, 0.24805331, 0.25124604, ..., 0.26913488, 0.24582317,
         0.24637443], dtype=float32),
  'true': array([0.25405908, 0.2475744 , 0.24266441, ..., 0.25619355, 0.2635005 ,
         0.24447615], dtype=float32),
  'r2_score': 0.3221144778474361,
  'mean_squared_error': 0.00039838598},
 'CD107a (Lamp-1)': {'pred': array([0.25488272, 0.2471615 , 0.5623303 , ..., 0.25523907, 0.25942624,
         0.25205564], dtype=float32),
  'true': array([0.2673547 , 0.25526446, 0.58498925, ..., 0.23471874, 0.26126346,
         0.24185239], dtype=float32),
  'r2_score': 0.8204075049850199,
  'mean_squared_error': 0.0018579975},
 'CD107b (Mac-3)': {'pred': array([0.26830757, 0.28503788, 0.2507204 , ..., 0.24908292, 0.25389788,
         0.24230753], dtype=float32),
  'true': array([0.29518053, 0.28111485, 0.255299  , ..., 0.2436143 , 0.2626211 ,
         0.2382466 ], dtype=float32),
  'r2_score': 0.7183401629602935,
  'mean_squared_error': 0.0014851231},
 'CD115': {'pred': array([0.2469954 , 0.25325543, 0.24312335, ..., 0.2437099 , 0.26318344,
         0.2492683 ], dtype=float32),
  'true': array([0.24417807, 0.25671375, 0.2558331 , ..., 0.23946778, 0.26340008,
         0.25041574], dtype=float32),
  'r2_score': 0.2159727711077275,
  'mean_squared_error': 0.00015180971},
 'Isotype_rIgG2b': {'pred': array([0.3152353 , 0.30198446, 0.24886578, ..., 0.24904986, 0.24726558,
         0.24757081], dtype=float32),
  'true': array([0.31431442, 0.34223923, 0.32557005, ..., 0.23365542, 0.26259127,
         0.24768159], dtype=float32),
  'r2_score': 0.2243016223785964,
  'mean_squared_error': 0.00011540718},
 'Isotype_mIgG1': {'pred': array([0.28719217, 0.2716357 , 0.26335174, ..., 0.27846232, 0.2770714 ,
         0.2506193 ], dtype=float32),
  'true': array([0.29004258, 0.31199   , 0.3189279 , ..., 0.30371755, 0.311677  ,
         0.29318032], dtype=float32),
  'r2_score': 0.2049955296927476,
  'mean_squared_error': 0.00071228656},
 'Isotype_AHIgG': {'pred': array([0.24830323, 0.2492735 , 0.25518328, ..., 0.24604471, 0.24749778,
         0.2534635 ], dtype=float32),
  'true': array([0.29159304, 0.26729953, 0.27030438, ..., 0.22781612, 0.24713887,
         0.24599802], dtype=float32),
  'r2_score': 0.21989004514011257,
  'mean_squared_error': 0.00014965146},
 'Isotype_rIgG2a': {'pred': array([0.25454333, 0.24098672, 0.2699412 , ..., 0.25125703, 0.24956013,
         0.24680562], dtype=float32),
  'true': array([0.2568882 , 0.281961  , 0.33056295, ..., 0.25644702, 0.24867922,
         0.24952465], dtype=float32),
  'r2_score': 0.17117436281629972,
  'mean_squared_error': 0.00016250834},
 'Isotype_rIgG1': {'pred': array([0.2509288 , 0.24883913, 0.25389326, ..., 0.24579412, 0.24850413,
         0.24942122], dtype=float32),
  'true': array([0.24504311, 0.24437752, 0.25067183, ..., 0.22886375, 0.24671295,
         0.23623529], dtype=float32),
  'r2_score': 0.21265289462313386,
  'mean_squared_error': 0.00014281561}}

Step 6: Predict InfinityMarker Values for Final InfinityFlow Object

The InfinityFlow_Utilities.make_flow_regression_predictions function is used to carry out the imputation on the reference FCS dataset to predict the InfinityMarker expression values. This function returns a tuple with the resulting object as an anndata.AnnData object, and a dictionary to store the timing of the prediction steps:

[10]:
sub_p_adata, timings_3 = InfinityFlow_Utilities.make_flow_regression_predictions(\
    file_handler=file_handler,
    regression_models=regression_models,
    use_logicle_scaling=True,
    normalization_method=None,
    verbosity=3)
Reading in data from .fcs files for pooling into final InfinityFlow object...
DEBUG:          Reading in the data for InfinityMarker 33D1...
DEBUG:          Reading in the data for InfinityMarker Allergin-1...
DEBUG:          Reading in the data for InfinityMarker B7-H4...
DEBUG:          Reading in the data for InfinityMarker CD1d...
DEBUG:          Reading in the data for InfinityMarker CD103...
DEBUG:          Reading in the data for InfinityMarker CD105...
DEBUG:          Reading in the data for InfinityMarker CD106...
DEBUG:          Reading in the data for InfinityMarker CD107a (Lamp-1)...
DEBUG:          Reading in the data for InfinityMarker CD107b (Mac-3)...
DEBUG:          Reading in the data for InfinityMarker CD115...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG2b...
DEBUG:          Reading in the data for InfinityMarker Isotype_mIgG1...
DEBUG:          Reading in the data for InfinityMarker Isotype_AHIgG...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG2a...
DEBUG:          Reading in the data for InfinityMarker Isotype_rIgG1...
Applying Logicle normalization to data...
Making predictions for final InfinityFlow object...
                Working on 33D1...
                Working on Allergin-1...
                Working on B7-H4...
                Working on CD1d...
                Working on CD103...
                Working on CD105...
                Working on CD106...
                Working on CD107a (Lamp-1)...
                Working on CD107b (Mac-3)...
                Working on CD115...
                Working on Isotype_rIgG2b...
                Working on Isotype_mIgG1...
                Working on Isotype_AHIgG...
                Working on Isotype_rIgG2a...
                Working on Isotype_rIgG1...
/media/kyle_storage/kyle_ferchen/Python/Env/pyInfinityFlow_dev/lib/python3.8/site-packages/pyInfinityFlow/InfinityFlow_Utilities.py:1469: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  predicted_var.loc[:,"IMPUTED"] = True
/media/kyle_storage/kyle_ferchen/Python/Env/pyInfinityFlow_dev/lib/python3.8/site-packages/pyInfinityFlow/InfinityFlow_Utilities.py:1482: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  var = pd.concat([raw_sub_p_adata.var, predicted_var]),

The resulting AnnData object (sub_p_adata) can now be used for downstream analysis steps!

Step 7: Isotype Background Correction

We can now carry out Isotype background correction using the InfinityFlow_Utilities.perform_background_correction function, which will return a tuple with 3 values:

  1. A pandas.DataFrame of the background corrected data

  2. The .var annotation to specify settings for the features

  3. Timings dictionary to track how much time was used in the function

[11]:
background_corrected_data, background_corrected_var, timings_4 = \
    InfinityFlow_Utilities.perform_background_correction(\
        sub_p_adata = sub_p_adata,
        infinity_marker_annotation = infinitymarker_anno,
        file_handler = file_handler,
        cores_to_use = 12,
        verbosity = 3)
DEBUG: Feature 33D1 will use isotype Isotype_rIgG2b...
DEBUG: Feature Allergin-1 will use isotype Isotype_mIgG1...
DEBUG: Feature B7-H4 will use isotype Isotype_AHIgG...
DEBUG: Feature CD1d will use isotype Isotype_rIgG2b...
DEBUG: Feature CD103 will use isotype Isotype_AHIgG...
DEBUG: Feature CD105 will use isotype Isotype_rIgG2a...
DEBUG: Feature CD106 will use isotype Isotype_rIgG2a...
DEBUG: Feature CD107a (Lamp-1) will use isotype Isotype_rIgG2a...
DEBUG: Feature CD107b (Mac-3) will use isotype Isotype_rIgG1...
DEBUG: Feature CD115 will use isotype Isotype_rIgG2a...
[12]:
background_corrected_data.head()
[12]:
33D1 Allergin-1 B7-H4 CD1d CD103 CD105 CD106 CD107a (Lamp-1) CD107b (Mac-3) CD115
0 0.053235 0.083288 0.062567 0.073332 0.050480 0.143983 0.087025 0.060597 0.050773 0.050279
1 0.059184 0.080860 0.043288 0.046708 0.030332 0.049352 0.079781 0.069040 0.047640 0.042443
2 0.059397 0.085905 0.048852 0.151401 0.021355 0.056680 0.089235 0.135968 0.109725 0.052312
3 0.050306 0.078846 0.060089 0.123346 0.041751 0.036221 0.080687 0.058094 0.036672 0.044819
4 0.060201 0.072160 0.055046 0.054883 0.049996 0.056747 0.083932 0.073318 0.054824 0.051910
[13]:
background_corrected_var.head()
[13]:
name USE_LOGICLE LOGICLE_T LOGICLE_W LOGICLE_M LOGICLE_A LOGICLE_APPLIED IMPUTED
33D1 InfinityMarker_33D1 True 3000000.0 0.0 3.0 1.0 True True
Allergin-1 InfinityMarker_Allergin-1 True 3000000.0 0.0 3.0 1.0 True True
B7-H4 InfinityMarker_B7-H4 True 3000000.0 0.0 3.0 1.0 True True
CD1d InfinityMarker_CD1d True 3000000.0 0.0 3.0 1.0 True True
CD103 InfinityMarker_CD103 True 3000000.0 0.0 3.0 1.0 True True

Step 8: Silencing Features

There are some channels that we may want to avoid considering for downstream analyses if they are not relevant to cell state (Eg. The ‘Time’ parameter). The InfinityFlow_Utilities.move_features_to_silent function will take the given features_to_silence out of the AnnData.X array, and move them into the AnnData.obsm[‘silent’] attribute.

For example, we can first list the features present in the InfinityFlow AnnData object:

[14]:
sub_p_adata.var.index.values
[14]:
array(['FSC-A', 'FSC-H', 'FSC-W', 'SSC-A', 'SSC-H', 'SSC-W',
       'FJComp-APC-A', 'FJComp-APC-eFlour780-A', 'FJComp-AlexaFluor700-A',
       'FJComp-BUV395-A', 'FJComp-BUV737-A', 'FJComp-BV421-A',
       'FJComp-BV510-A', 'FJComp-BV605-A', 'FJComp-BV650-A',
       'FJComp-BV711-A', 'FJComp-BV786-A', 'FJComp-GFP-A',
       'FJComp-PE(yg)-A', 'FJComp-PE-Cy7(yg)-A', 'FJComp-PerCP-Cy5-5-A',
       'Time', '33D1', 'Allergin-1', 'B7-H4', 'CD1d', 'CD103', 'CD105',
       'CD106', 'CD107a (Lamp-1)', 'CD107b (Mac-3)', 'CD115',
       'Isotype_rIgG2b', 'Isotype_mIgG1', 'Isotype_AHIgG',
       'Isotype_rIgG2a', 'Isotype_rIgG1'], dtype=object)

Let’s move some of the features to silent, so they are not considered for dimensionality reduction or clustering:

[15]:
features_to_silence = ['FSC-A', 'FSC-H', 'FSC-W', 'SSC-A', 'SSC-H', 'SSC-W',
    'FJComp-PE(yg)-A', 'Isotype_rIgG2b', 'Isotype_mIgG1', 'Isotype_AHIgG',
    'Isotype_rIgG2a', 'Isotype_rIgG1', 'Time']

sub_p_adata = InfinityFlow_Utilities.move_features_to_silent(sub_p_adata, features_to_silence)
sub_p_adata
[15]:
AnnData object with n_obs × n_vars = 15000 × 24
    obs: 'cell_number', 'batch'
    var: 'name', 'USE_LOGICLE', 'LOGICLE_T', 'LOGICLE_W', 'LOGICLE_M', 'LOGICLE_A', 'LOGICLE_APPLIED', 'IMPUTED'
    uns: 'obs_file_origin', 'silent_var'
    obsm: 'silent'

As you can see, the AnnData object now contains an obsm key ‘silent’ to store the event values for the silenced features, as well as a ‘silent_var’ pandas.DataFrame in the AnnData.uns attribute.

[16]:
sub_p_adata.obsm['silent'].head()
[16]:
FSC-A FSC-H FSC-W SSC-A SSC-H SSC-W FJComp-PE(yg)-A Isotype_rIgG2b Isotype_mIgG1 Isotype_AHIgG Isotype_rIgG2a Isotype_rIgG1 Time
F0:5 44290.410156 48154.0 60277.785156 3829.130127 3626.0 0.590961 0.258807 0.252803 0.247846 0.250860 0.246098 0.256654 0.250912
F0:113 33078.601562 28222.0 76813.804688 9273.810547 7514.0 0.607835 0.245435 0.235438 0.238048 0.257867 0.249520 0.250518 0.251031
F0:137 141369.265625 105760.0 87601.898438 26223.259766 22176.0 0.603203 0.195596 0.230424 0.277973 0.244754 0.227513 0.227703 0.251047
F0:375 86083.023438 62675.0 90012.554688 9570.060547 7683.0 0.608832 0.238332 0.248544 0.235057 0.245086 0.249628 0.247355 0.251229
F0:430 126470.789062 99065.0 83666.179688 7711.190430 6665.0 0.600839 0.213695 0.249908 0.270671 0.260307 0.245320 0.248516 0.251268
[17]:
sub_p_adata.uns['silent_var'].head()
[17]:
name USE_LOGICLE LOGICLE_T LOGICLE_W LOGICLE_M LOGICLE_A LOGICLE_APPLIED IMPUTED
FSC-A False 3000000.0 0.0 3.0 1.0 False False
FSC-H False 3000000.0 0.0 3.0 1.0 False False
FSC-W False 3000000.0 0.0 3.0 1.0 False False
SSC-A False 3000000.0 0.0 3.0 1.0 False False
SSC-H False 3000000.0 0.0 3.0 1.0 False False

Step 9: Dimensionality Reduction

Now that the InfinityFlow results are in an AnnData object, we can use the tools provided by Scanpy to perform downstream analysis.

PCA

If there are a lot of features in the dataset, it may be beneficial to use Principal component analysis to reduce the feature space to a smaller set that captures most of the variation observed.

We can apply the scanpy.tl.pca function to carry this out on our InfinityFlow AnnData object. InfinityFlow_Utilities.make_pca_elbo_plot can then be used to generate an elbo plot so we can estimate how few features we can get away with using that capture most of the variation in the dataset:

[18]:
import scanpy as sc
sc.tl.pca(sub_p_adata)
# It is useful to save the features that were used
# at the time the PCA function was called, as the
# silenced features may change when the object is
# reloaded.
sub_p_adata.uns['pca_features'] = sub_p_adata.var.index.values

# Make the elbo plot:
InfinityFlow_Utilities.make_pca_elbo_plot(\
    sub_p_adata=sub_p_adata,
    output_paths=output_paths)
../_images/notebooks_pyInfinityFlow_API_Tutorial_35_0.png

Here, we can see that with the first 15 Principal components, we capture most of the explained variance in the data. So for downstream analysis steps, we will select to use 15 PCs.

Note that the make_pca_elbo_plot function will save this PC Elbo Plot to the ‘QC’ directory of the output_dir that we specified in the InfinityFlow_Utilities.setup_output_directories function. This produced the output_paths dictionary. We can check to see where the ‘QC’ folder in the output_paths on our machine is with the following:

[19]:
# List the keys available in output_paths
output_paths.keys()
[19]:
dict_keys(['output_regression_path', 'output_umap_feature_plot_path', 'clustering', 'qc', 'output_umap_bc_feature_plot_path'])
[20]:
# Print out the 'qc' directory path
output_paths['qc']
[20]:
'/media/kyle_ssd1/outputs/QC'

UMAP

UMAP is a very popular method for dimensionality reduction, particularly for the practice of reducing the feature space to 2-Dimensions to view that data as a scatterplot. With this, we can observe the global structure of the data to get an idea of what groups of observations exist. In the context of Flow Cytometry, we can also cluster the data to identify cell types based on surface marker phenotypes.

To carry out UMAP 2D-Dimensionality reduction, we can again use scanpy. First, we need to generate aa estimate of the adjacency matrix, using the scanpy.pp.neighbors function, which will help the UMAP function optimize where to put the observations in our dataset on the reduced dimension space. We will specify the function to use the first 15 PCs:

[21]:
sc.pp.neighbors(sub_p_adata, n_pcs=15)
sub_p_adata
[21]:
AnnData object with n_obs × n_vars = 15000 × 24
    obs: 'cell_number', 'batch'
    var: 'name', 'USE_LOGICLE', 'LOGICLE_T', 'LOGICLE_W', 'LOGICLE_M', 'LOGICLE_A', 'LOGICLE_APPLIED', 'IMPUTED'
    uns: 'obs_file_origin', 'silent_var', 'pca', 'pca_features', 'neighbors'
    obsm: 'silent', 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

As you can see, this added the ‘neighbors’ key to the sub_p_adata.uns attribute, as well as the ‘distances’ and ‘connectivities’ to the sub_p_adata.obsp attribute.

We can then call the scanpy.tl.umap function to generate the low dimensional embedding:

[22]:
sc.tl.umap(sub_p_adata)
sub_p_adata
[22]:
AnnData object with n_obs × n_vars = 15000 × 24
    obs: 'cell_number', 'batch'
    var: 'name', 'USE_LOGICLE', 'LOGICLE_T', 'LOGICLE_W', 'LOGICLE_M', 'LOGICLE_A', 'LOGICLE_APPLIED', 'IMPUTED'
    uns: 'obs_file_origin', 'silent_var', 'pca', 'pca_features', 'neighbors', 'umap'
    obsm: 'silent', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

We will then move the 2 UMAP vectors to the sub_p_adata.obs DataFrame:

[23]:
sub_p_adata.obs["umap-x"] = sub_p_adata.obsm['X_umap'][:,0]
sub_p_adata.obs["umap-y"] = sub_p_adata.obsm['X_umap'][:,1]
sub_p_adata.obs.head()
[23]:
cell_number batch umap-x umap-y
F0:5 5 33D1 -3.683015 13.849557
F0:113 113 33D1 13.346261 0.064281
F0:137 137 33D1 7.532703 -3.680286
F0:375 375 33D1 9.984137 7.248507
F0:430 430 33D1 12.682547 0.622352

Step 10: Making Feature Plots

Next, we will make feature plots of each feature currently stored in the sub_p_adata.var space (not the silenced features). The InfinityFlow_Utilities.save_umap_figures_all_features function is called on the InfinityFlow AnnData object. Note, we can include the background_corrected_data to also plot the background corrected features.

This function will save the original prediction feature figures to the ‘output_umap_feature_plot_path’ and the background corrected feature figures to the ‘output_umap_bc_feature_plot_path’ in the output_paths dictionary:

[24]:
timings_6 = InfinityFlow_Utilities.save_umap_figures_all_features(\
    sub_p_adata,
    background_corrected_data = background_corrected_data,
    file_handler = file_handler,
    output_paths = output_paths,
    verbosity=3)
Working on plotting feature 33D1...
Working on plotting feature Allergin-1...
Working on plotting feature B7-H4...
Working on plotting feature CD103...
Working on plotting feature CD105...
Working on plotting feature CD106...
Working on plotting feature CD107a (Lamp-1)...
Working on plotting feature CD107b (Mac-3)...
Working on plotting feature CD115...
Working on plotting feature CD1d...
Working on plotting feature FJComp-APC-A...
Working on plotting feature FJComp-APC-eFlour780-A...
Working on plotting feature FJComp-AlexaFluor700-A...
Working on plotting feature FJComp-BUV395-A...
Working on plotting feature FJComp-BUV737-A...
Working on plotting feature FJComp-BV421-A...
Working on plotting feature FJComp-BV510-A...
Working on plotting feature FJComp-BV605-A...
Working on plotting feature FJComp-BV650-A...
Working on plotting feature FJComp-BV711-A...
Working on plotting feature FJComp-BV786-A...
Working on plotting feature FJComp-GFP-A...
Working on plotting feature FJComp-PE-Cy7(yg)-A...
Working on plotting feature FJComp-PerCP-Cy5-5-A...
../_images/notebooks_pyInfinityFlow_API_Tutorial_46_1.png

Step 11: Clustering the Data

Next, we can try to cluster the events from our FCS data. The Leiden algorithm is a popular method for clustering data, and is provided in the scanpy.tl.leiden function. It will utilize the estimated adjacency matrix produced by scanpy.pp.neighbors.

[25]:
sc.tl.leiden(sub_p_adata)
sub_p_adata
[25]:
AnnData object with n_obs × n_vars = 15000 × 24
    obs: 'cell_number', 'batch', 'umap-x', 'umap-y', 'leiden'
    var: 'name', 'USE_LOGICLE', 'LOGICLE_T', 'LOGICLE_W', 'LOGICLE_M', 'LOGICLE_A', 'LOGICLE_APPLIED', 'IMPUTED'
    uns: 'obs_file_origin', 'silent_var', 'pca', 'pca_features', 'neighbors', 'umap', 'leiden'
    obsm: 'silent', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

You can see that this function added the ‘leiden’ feature to our sub_p_adata.obs DataFrame, as well as the ‘leiden’ key to the sub_p_adata.uns attribute to store the parameters provided to the Leiden clustering algorithm.

Specify Colors for Clusters

Let’s specify a set of colors to use for later plotting the clusters. The Plotting_Utilities.assign_rainbow_colors_to_groups function provides a quick way to assign colors to a set of cluster assignments:

[26]:
from pyInfinityFlow.Plotting_Utilities import assign_rainbow_colors_to_groups
groups_to_colors = assign_rainbow_colors_to_groups(\
    sub_p_adata.obs["leiden"].values)

sub_p_adata.uns['groups_to_color'] = groups_to_colors
sub_p_adata.uns['groups_to_color']
[26]:
{'0': '#8000ff',
 '1': '#6c1fff',
 '10': '#5641fd',
 '11': '#4062fa',
 '12': '#2c7ef7',
 '13': '#169bf2',
 '14': '#00b5eb',
 '15': '#14cae5',
 '16': '#2adddd',
 '17': '#40ecd4',
 '18': '#54f6cb',
 '19': '#6afdc0',
 '2': '#80ffb4',
 '20': '#94fda8',
 '21': '#abf69b',
 '22': '#c0eb8d',
 '23': '#d4dd80',
 '3': '#ebca70',
 '4': '#ffb360',
 '5': '#ff9b52',
 '6': '#ff7e41',
 '7': '#ff5f30',
 '8': '#ff4121',
 '9': '#ff1f10'}

This simply set a hexadecimal color value to each of the ‘leiden’ clusters and stored the mapping as a dictionary to later use for plotting the clusters.

Plotting Leiden clusters over UMAP

Now that we have colors associated with our Leiden clusters in the sub_p_adata.uns['groups_to_color'] attribute, we can project those clusters onto the 2D-UMAP to get an idea of where each cluster sits in the reduced dimensional space.

The plot_leiden_clusters_over_umap function will take in the InfinityFlow AnnData object and the output_paths dictionary and save this UMAP to the ‘clustering’ directory in the output_dir:

[27]:
from pyInfinityFlow.Plotting_Utilities import plot_leiden_clusters_over_umap

plot_leiden_clusters_over_umap(\
    sub_p_adata=sub_p_adata,
    output_paths=output_paths,
    verbosity=3)
../_images/notebooks_pyInfinityFlow_API_Tutorial_54_0.png
[28]:
# Clustering directory
print(output_paths['clustering'])
/media/kyle_ssd1/outputs/clustering

Step 12: Find Markers for Clusters

We can use the MarkerFinder algorithm to assign each feature to a cluster which it best uniquely identifies. This is provided as the InfinityFlow_Utilities.find_markers_from_anndata function, and works directly on the InfinityFlow formatted AnnData object.

[29]:
markers_df, cell_assignments = InfinityFlow_Utilities.find_markers_from_anndata(\
    sub_p_adata=sub_p_adata,
    output_paths=output_paths,
    groups_to_colors=sub_p_adata.uns['groups_to_color'],
    verbosity=3)
Finding markers for Infinity Flow object...
Plotting markers...
../_images/notebooks_pyInfinityFlow_API_Tutorial_57_1.png

Note, this will save a heatmap of the markers vs. clusters in the output_paths['clustering'] directory, as well as a csv file with the MarkerFinder results:

[30]:
# Clustering outputs directory
print(output_paths['clustering'])
# Contents of the directory
os.listdir(output_paths['clustering'])
/media/kyle_ssd1/outputs/clustering
[30]:
['cluster_markers.csv', 'cluster_markers.pdf', 'Leiden_Clusters_over_UMAP.png']

We can also now plot the Leiden clusters over the UMAP plot to get an idea of where each cluster sits in the reduced dimensional space:

Step 13: Moving Features out of Silent

After we have performed dimensionality reduction and clustering with our features of interest, we may want to move the features that we previously silenced back into the AnnData object. This will make it so when we save our final FCS file, we can include features we may have silenced (Eg. ‘Time’ or ‘FSC-A’) that we want to add back.

We can list what features are currently silenced:

[31]:
sub_p_adata.uns['silent_var'].index.values
[31]:
array(['FSC-A', 'FSC-H', 'FSC-W', 'SSC-A', 'SSC-H', 'SSC-W',
       'FJComp-PE(yg)-A', 'Isotype_rIgG2b', 'Isotype_mIgG1',
       'Isotype_AHIgG', 'Isotype_rIgG2a', 'Isotype_rIgG1', 'Time'],
      dtype=object)

Let’s move the ‘FSC-A’, ‘FSC-H’, ‘FSC-W’, ‘SSC-A’, ‘SSC-H’, and ‘SSC-W’ features out if of the silenced space. The InfinityFlow_Utilities.move_features_out_of_silent function will take in our AnnData object along with a list of features to move out of silent:

[32]:
features_to_unsilence = ['FSC-A', 'FSC-H', 'FSC-W', 'SSC-A', 'SSC-H', 'SSC-W']
sub_p_adata = InfinityFlow_Utilities.move_features_out_of_silent(\
    sub_p_adata,
    features_to_unsilence)

Now these features are back in the AnnData.X and AnnData.var attributes!

[33]:
sub_p_adata.var
[33]:
name USE_LOGICLE LOGICLE_T LOGICLE_W LOGICLE_M LOGICLE_A LOGICLE_APPLIED IMPUTED
33D1 InfinityMarker_33D1 True 3000000.0 0.0 3.0 1.0 True True
Allergin-1 InfinityMarker_Allergin-1 True 3000000.0 0.0 3.0 1.0 True True
B7-H4 InfinityMarker_B7-H4 True 3000000.0 0.0 3.0 1.0 True True
CD103 InfinityMarker_CD103 True 3000000.0 0.0 3.0 1.0 True True
CD105 InfinityMarker_CD105 True 3000000.0 0.0 3.0 1.0 True True
CD106 InfinityMarker_CD106 True 3000000.0 0.0 3.0 1.0 True True
CD107a (Lamp-1) InfinityMarker_CD107a (Lamp-1) True 3000000.0 0.0 3.0 1.0 True True
CD107b (Mac-3) InfinityMarker_CD107b (Mac-3) True 3000000.0 0.0 3.0 1.0 True True
CD115 InfinityMarker_CD115 True 3000000.0 0.0 3.0 1.0 True True
CD1d InfinityMarker_CD1d True 3000000.0 0.0 3.0 1.0 True True
FJComp-APC-A CD69-CD301b True 3000000.0 0.0 3.0 1.0 True False
FJComp-APC-eFlour780-A Zombie True 3000000.0 0.0 3.0 1.0 True False
FJComp-AlexaFluor700-A MHCII True 3000000.0 0.0 3.0 1.0 True False
FJComp-BUV395-A CD4 True 3000000.0 0.0 3.0 1.0 True False
FJComp-BUV737-A CD44 True 3000000.0 0.0 3.0 1.0 True False
FJComp-BV421-A CD8 True 3000000.0 0.0 3.0 1.0 True False
FJComp-BV510-A CD11c True 3000000.0 0.0 3.0 1.0 True False
FJComp-BV605-A CD11b True 3000000.0 0.0 3.0 1.0 True False
FJComp-BV650-A F480 True 3000000.0 0.0 3.0 1.0 True False
FJComp-BV711-A Ly6C True 3000000.0 0.0 3.0 1.0 True False
FJComp-BV786-A Lineage True 3000000.0 0.0 3.0 1.0 True False
FJComp-GFP-A CD45a488 True 3000000.0 0.0 3.0 1.0 True False
FJComp-PE-Cy7(yg)-A CD24 True 3000000.0 0.0 3.0 1.0 True False
FJComp-PerCP-Cy5-5-A CD103 True 3000000.0 0.0 3.0 1.0 True False
FSC-A False 3000000.0 0.0 3.0 1.0 False False
FSC-H False 3000000.0 0.0 3.0 1.0 False False
FSC-W False 3000000.0 0.0 3.0 1.0 False False
SSC-A False 3000000.0 0.0 3.0 1.0 False False
SSC-H False 3000000.0 0.0 3.0 1.0 False False
SSC-W True 3000000.0 0.0 3.0 1.0 True False

Step 14: Saving Regression Outputs

Now that we have our final InfinityFlow object stored in AnnData format, we can save it to storage in different formats for later downstream analyses.

h5ad File

The h5ad file will preserve the structure of the AnnData object, and let’s us quickly load the data for future processing with tools like Scanpy. We can simply use the .write method on the AnnData object to write the file as an h5ad file.

[34]:
h5_path = os.path.join(output_paths['output_regression_path'],
    "InfinityFlow_object_logicle_normalized.h5ad")

sub_p_adata.write(h5_path)

The output_paths['output_regression_path'] can provide the traditional path set up with the output_paths directory to save the file.

[35]:
# Eg. Using the output_paths directory
output_paths['output_regression_path']
[35]:
'/media/kyle_ssd1/outputs/regression_results'

Feather File

The Feather file format is commonly used for DataFrame objects. We will lose some of the information present in the AnnData object, but we will be able to very quickly load the DataFrame back into memory:

The InfinityFlow_Utilities.anndata_to_df function provides a quick way to convert the AnnData object to a DataFrame. We can then reset the index and save the DataFrame as a Feather file with the .to_feather method provided by Pandas:

[36]:
# Create an output path for the DataFrame
feather_path = os.path.join(output_paths['output_regression_path'],
    "InfinityFlow_object_logicle_normalized.fea")
# Convert to DataFrame
df = InfinityFlow_Utilities.anndata_to_df(\
    input_anndata=sub_p_adata,
    use_raw_feature_names=False,
    add_index_names=True)
# Save as Feather file
df.reset_index().to_feather(feather_path)

df.head()
[36]:
33D1:InfinityMarker_33D1 Allergin-1:InfinityMarker_Allergin-1 B7-H4:InfinityMarker_B7-H4 CD103:InfinityMarker_CD103 CD105:InfinityMarker_CD105 CD106:InfinityMarker_CD106 CD107a (Lamp-1):InfinityMarker_CD107a (Lamp-1) CD107b (Mac-3):InfinityMarker_CD107b (Mac-3) CD115:InfinityMarker_CD115 CD1d:InfinityMarker_CD1d ... FJComp-BV786-A:Lineage FJComp-GFP-A:CD45a488 FJComp-PE-Cy7(yg)-A:CD24 FJComp-PerCP-Cy5-5-A:CD103 FSC-A: FSC-H: FSC-W: SSC-A: SSC-H: SSC-W:
F0:5 0.252006 0.221211 0.254082 0.249303 0.389210 0.249804 0.253386 0.300907 0.253446 0.307087 ... 0.235586 0.223836 0.272052 0.276451 44290.410156 48154.0 60277.785156 3829.130127 3626.0 0.590961
F0:113 0.245084 0.211464 0.234267 0.227056 0.261304 0.243082 0.286747 0.282273 0.246122 0.242309 ... 0.402271 0.503064 0.489275 0.321054 33078.601562 28222.0 76813.804688 9273.810547 7514.0 0.607835
F0:137 0.241113 0.245277 0.229664 0.220823 0.250333 0.235402 0.419354 0.373128 0.240887 0.404543 ... 0.416697 0.678155 0.305735 0.509343 141369.265625 105760.0 87601.898438 26223.259766 22176.0 0.603203
F0:375 0.244557 0.206941 0.245322 0.241661 0.243205 0.244426 0.255690 0.252353 0.249280 0.382239 ... 0.296814 0.598728 0.494562 0.217368 86083.023438 62675.0 90012.554688 9570.060547 7683.0 0.608832
F0:430 0.258684 0.223491 0.252599 0.246665 0.267524 0.244834 0.287726 0.294017 0.254916 0.273731 ... 0.373348 0.558255 0.484305 0.337422 126470.789062 99065.0 83666.179688 7711.190430 6665.0 0.600839

5 rows × 30 columns

FCS File

After converting the InfinityFlow object back into an FCS file, we can then open the file in traditional Flow Cytometry Analysis software tools (Eg. Flowjo) to perform different custom downstream analyses, like gating to certain populations.

Inverting the Logicle Normalization

However, since we used Logicle normalization to more accurately carry out regression and perform dimensionality reduction and clustering, we should invert the Logicle normalization back to the original fluorescence intensity measurements. The pyInfinityFlow format of the AnnData object stores the method for carrying out the Logicle normalization and inverting it in the .var attribute:

[37]:
sub_p_adata.var.head()
[37]:
name USE_LOGICLE LOGICLE_T LOGICLE_W LOGICLE_M LOGICLE_A LOGICLE_APPLIED IMPUTED
33D1 InfinityMarker_33D1 True 3000000.0 0.0 3.0 1.0 True True
Allergin-1 InfinityMarker_Allergin-1 True 3000000.0 0.0 3.0 1.0 True True
B7-H4 InfinityMarker_B7-H4 True 3000000.0 0.0 3.0 1.0 True True
CD103 InfinityMarker_CD103 True 3000000.0 0.0 3.0 1.0 True True
CD105 InfinityMarker_CD105 True 3000000.0 0.0 3.0 1.0 True True

We can invert the Logicle normalization on the features that have the USE_LOGICLE column set to True by using the InfinityFlow_Utilities.apply_inverse_logicle_to_anndata function:

[38]:
InfinityFlow_Utilities.apply_inverse_logicle_to_anndata(sub_p_adata)
InfinityFlow_Utilities.anndata_to_df(sub_p_adata).head()
[38]:
33D1 Allergin-1 B7-H4 CD103 CD105 CD106 CD107a (Lamp-1) CD107b (Mac-3) CD115 CD1d ... FJComp-BV786-A FJComp-GFP-A FJComp-PE-Cy7(yg)-A FJComp-PerCP-Cy5-5-A FSC-A FSC-H FSC-W SSC-A SSC-H SSC-W
F0:5 110.998680 -1611.187378 225.875824 -38.550339 9981.195312 -10.849790 187.352631 2918.396240 190.660049 3302.118652 ... -799.727173 -1461.367188 1228.294067 1477.697266 44290.410156 48154.0 60277.785156 3829.130127 3626.0 69207.367188
F0:113 -272.058624 -2176.202148 -873.436096 -1278.713989 626.500122 -382.992645 2071.328857 1811.447266 -214.581619 -425.844513 ... 11457.474609 30567.076172 26846.810547 4209.315430 33078.601562 28222.0 76813.804688 9273.810547 7514.0 80884.835938
F0:137 -492.217468 -261.397980 -1131.551514 -1633.387573 18.449854 -810.031494 13642.258789 8355.895508 -504.758484 11730.511719 ... 13281.506836 154731.609375 3217.331787 32420.519531 141369.265625 105760.0 87601.898438 26223.259766 22176.0 77496.765625
F0:375 -301.269897 -2443.946289 -258.863922 -461.783051 -376.178040 -308.487518 314.955200 130.189346 -39.852901 9252.339844 ... 2669.240967 74358.414062 28219.044922 -1832.219971 86083.023438 62675.0 90012.554688 9570.060547 7683.0 81632.640625
F0:430 480.960388 -1480.978882 143.819992 -184.526932 973.659363 -285.889282 2128.691162 2501.141357 272.047882 1323.218140 ... 8376.817383 51127.210938 25615.402344 5362.881836 126470.789062 99065.0 83666.179688 7711.190430 6665.0 75823.031250

5 rows × 30 columns

You can see that the fluorescence derived values are now no longer between 0 and 1, indicating that the Logicle normalization has been inverted.

We can then save the data as an FCS file with the InfinityFlow_Utilities.save_fcs_flow_anndata function:

[39]:
InfinityFlow_Utilities.save_fcs_flow_anndata(\
    sub_p_adata = sub_p_adata,
    background_corrected_data = background_corrected_data,
    background_corrected_var = background_corrected_var,
    file_handler = file_handler,
    output_paths = output_paths,
    add_umap = True,
    use_logicle = True,
    verbosity=3)
Writing out base prediction values to fcs file...
WARNING! No features required inverting logicle normalization at this time.
Omitting spillover matrix...
WARNING! TEXT segment value for key $P25S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P26S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P27S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P28S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P29S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P30S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P31S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P32S is empty. Excluding from written file.
Writing out background-corrected prediction values to fcs file...
Omitting spillover matrix...
WARNING! TEXT segment value for key $P25S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P26S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P27S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P28S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P29S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P30S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P31S is empty. Excluding from written file.
WARNING! TEXT segment value for key $P32S is empty. Excluding from written file.
[39]:
{'file_export': 2.1501059532165527}

Finish

We have now carried out all of the steps of the analysis pipeline provided by pyInfinityFlow, using the API!