Compare commits
14 Commits
7558455f39
...
v1.0
| Author | SHA1 | Date | |
|---|---|---|---|
| 4b053e6ec4 | |||
| 44784b7976 | |||
| 7c19896dc9 | |||
| aec7e3016f | |||
| 5c75c1ac09 | |||
| cb1f7adece | |||
| 370de79546 | |||
| a803336f56 | |||
| 94b54b3416 | |||
| 601e141fd0 | |||
| 8f9c6b7d33 | |||
| e5ddc73723 | |||
| 9b18fac74f | |||
| 63ef6aa7a0 |
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
/out/
|
||||
Binary file not shown.
124
readme.md
124
readme.md
@@ -4,7 +4,7 @@
|
||||
## ABOUT
|
||||
|
||||
This program simulates BiGpairSEQ (Bipartite Graph pairSEQ), a graph theory-based adaptation
|
||||
of the pairSEQ algorithm (Howie et al. 2015) for pairing T cell receptor sequences.
|
||||
of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequences.
|
||||
|
||||
## THEORY
|
||||
|
||||
@@ -15,19 +15,18 @@ directly.
|
||||
BiGpairSEQ creates a [simple bipartite weighted graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
|
||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||||
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
||||
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
|
||||
(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
|
||||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||
|
||||
This is a well-studied combinatorial optimization problem, with many known solutions.
|
||||
The best currently-known algorithm for bipartite graphs with integer weights--which is what BiGpairSEQ uses--
|
||||
The most efficient algorithm known to the author for maximum weight matching of a bipartite graph with strictly integral weights
|
||||
is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
|
||||
their algorithm runs in **O(m sqrt(n) log(N))** time. This is the best known efficiency for finding a maximum weight
|
||||
matching on a bipartite graph, and the integer edge weight requirement makes it ideal for BiGpairSEQ.
|
||||
their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is
|
||||
bipartite with integer weights, this algorithm is ideal for BiGpairSEQ.
|
||||
|
||||
Unfortunately, it's a fairly new algorithm, and the integer edge weight requirement makes it less generically useful.
|
||||
It is not implemented by the graph theory library used in this simulator. So this program
|
||||
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
||||
Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator.
|
||||
So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
||||
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
|
||||
|
||||
The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue,
|
||||
@@ -56,7 +55,7 @@ main menu looks like this:
|
||||
|
||||
```
|
||||
--------BiGPairSEQ SIMULATOR--------
|
||||
ALPHA/BETA T-CELL RECEPTOR MATCHING
|
||||
ALPHA/BETA T CELL RECEPTOR MATCHING
|
||||
USING WEIGHTED BIPARTITE GRAPHS
|
||||
------------------------------------
|
||||
Please select an option:
|
||||
@@ -81,9 +80,9 @@ writing files, the program will automatically add the correct extension to any f
|
||||
|
||||
#### Cell Sample Files
|
||||
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
||||
four sequences: Alpha CDR3, Beta CDR, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||||
random integers. CDR3 Alpha and Beta sequences are all unique. CDR1 Alpha and Beta sequences
|
||||
are not necessarily unique; the relative diversity can be set when making a Cell Sample file.
|
||||
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||||
random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences
|
||||
are not necessarily unique; the relative diversity can be set when making the file.
|
||||
|
||||
(Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)
|
||||
|
||||
@@ -94,7 +93,7 @@ Options when making a Cell Sample file:
|
||||
Files are in CSV format. Rows are distinct T cells, columns are sequences within the cells.
|
||||
Comments are preceded by `#`
|
||||
|
||||
Structure example:
|
||||
Structure:
|
||||
|
||||
---
|
||||
# Sample contains 1 unique CDR1 for every 4 unique CDR3s.
|
||||
@@ -108,9 +107,9 @@ Structure example:
|
||||
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
||||
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
||||
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
||||
given dropout probability, be omitted from the file. This simulates the effect of amplification errors
|
||||
prior to sequencing. Plates can also be partitioned into any number of (approximately) evenly-sized
|
||||
sections, each of which can have a different number of T cells per well.
|
||||
given dropout probability, be omitted from the file; this simulates the effect of amplification errors
|
||||
prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a
|
||||
different concentration of T cells per well.
|
||||
|
||||
Options when making a Sample Plate file:
|
||||
* Cell Sample file to use
|
||||
@@ -120,7 +119,7 @@ Options when making a Sample Plate file:
|
||||
* Standard deviation size
|
||||
* Exponential
|
||||
* Lambda value
|
||||
* Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie et al. 2015)
|
||||
* (Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))
|
||||
* Total number of wells on the plate
|
||||
* Number of sections on plate
|
||||
* Number of T cells per well
|
||||
@@ -128,42 +127,42 @@ Options when making a Sample Plate file:
|
||||
* Dropout rate
|
||||
|
||||
Files are in CSV format. There are no header labels. Every row represents a well.
|
||||
Every column represents an individual cell, containing four sequences, represented by an array string:
|
||||
Every column represents an individual cell, containing four sequences, depicted as an array string:
|
||||
`[CDR3A, CDR3B, CDR1A, CDR1B]`. So a representative cell might look like this:
|
||||
|
||||
`[525902, 791533, -1, 866282]`
|
||||
|
||||
Notice that the Alpha CDR1 is missing in the cell above, due to sequence dropout.
|
||||
Dropouts are represented by replacing sequences with the value `-1`. Comments are preceded by `#`
|
||||
Notice that the CDR1 Alpha is missing in the cell above--sequence dropout from simulated amplification error.
|
||||
Dropout sequences are replaced with the value `-1`. Comments are preceded by `#`
|
||||
|
||||
Structure Example:
|
||||
Structure:
|
||||
|
||||
---
|
||||
```
|
||||
# Cell source file name: 4MilCells.csv
|
||||
# Plate size: 96
|
||||
# Error rate: 0.1
|
||||
# Concentrations: 10000 5000 500
|
||||
# Lambda: 0.6
|
||||
# Cell source file name:
|
||||
# Each row represents one well on the plate
|
||||
# Plate size:
|
||||
# Concentrations:
|
||||
# Lambda -or- StdDev:
|
||||
```
|
||||
| well 1 | well 2 | well 3| ... |
|
||||
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
||||
|---|---|---|---|
|
||||
| [105383, 786528, 959247, 925928] | [525902, 791533, -1, 866282] | [409236, 132303, 804465, 942261]| ... |
|
||||
| [249930, 301502, 970003, 881099] | [523787, 552952, 997194, 970507]| [425363, 417411, 845399, -1]| ... |
|
||||
| ... | ... | ... | ... |
|
||||
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| **...** |
|
||||
| **Well 3, cell 1** | **Well 3, cell 2** | **Well 3, cell 3**| **...** |
|
||||
| **...** | **...** | **...** | **...** |
|
||||
|
||||
---
|
||||
|
||||
#### Graph and Data Files
|
||||
Graph and Data files are serialized binaries of a Java object containing the graph representation of a
|
||||
Sample Plate and necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking
|
||||
the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated
|
||||
occupancy graph). These files can be several gigabytes in size. Writing them to a file lets us generate a graph and
|
||||
its metadata once, then use it for multiple different BiGpairSEQ simulations.
|
||||
Graph and Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a
|
||||
Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file
|
||||
(to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a
|
||||
Sample Plate file (to construct the associated occupancy graph). These files can be several gigabytes in size.
|
||||
Writing them to a file lets us generate a graph and its metadata once, then use it for multiple different BiGpairSEQ simulations.
|
||||
|
||||
Options for creating a Graph and Data file:
|
||||
* The Cell Sample file to use
|
||||
* The Sample Plate file (generated from the given Cell Sample file) to use.
|
||||
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
|
||||
|
||||
These files do not have a human-readable structure, and are not portable to other programs. (Export of graphs in a
|
||||
portable data format may be implemented in the future. The tricky part is encoding the necessary metadata.)
|
||||
@@ -181,10 +180,9 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
||||
* The maximum number of alpha/beta overlap wells to attempt to match
|
||||
* (must be <= the number of wells on the plate - 1)
|
||||
* The maximum difference in alpha/beta occupancy to attempt to match
|
||||
* (To skip using this filter, enter a value >= the number of wells on the plate)
|
||||
* The minimum percentage of a sequence's occupied wells shared by another sequence to attempt to match
|
||||
* given value from 0 to 100
|
||||
* (To skip using this filter, enter 0)
|
||||
* (Optional. To skip using this filter, enter a value >= the number of wells on the plate)
|
||||
* The minimum overlap percentage--the percentage of a sequence's occupied wells shared by another sequence--to attempt to match. Given as value in range 0 - 100.
|
||||
* (Optional. To skip using this filter, enter 0)
|
||||
|
||||
Example output:
|
||||
|
||||
@@ -217,34 +215,54 @@ Example output:
|
||||
---
|
||||
|
||||
**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
||||
P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison,
|
||||
P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison only,
|
||||
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
|
||||
|
||||
### PERFORMANCE
|
||||
Performance details of the example excerpted above:
|
||||
|
||||
On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM (half of which was allocated to the Java Virtual Machine), and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel),
|
||||
the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas,
|
||||
taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution.
|
||||
|
||||
With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151
|
||||
correct pairings and 18 incorrect pairings, for an accuracy of 99.652%.
|
||||
|
||||
The simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.
|
||||
|
||||
Since this implementation of BiGpairSEQ writes intermediate results to disk (to improve the efficiency of *repeated* simulations
|
||||
with different filtering options), the actual elapsed time was greater. File I/O time was not measured, but took
|
||||
slightly less time than the simulation itself. Real elapsed time from start to finish was under 30 minutes.
|
||||
|
||||
## TODO
|
||||
|
||||
* ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
|
||||
* ~~Hold graph data in memory until another graph is read-in?~~
|
||||
* No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.
|
||||
* ~~See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows~~ DONE
|
||||
* ~~Hold graph data in memory until another graph is read-in?~~ ABANDONED
|
||||
* *No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*
|
||||
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
|
||||
* ~~Problem is variable number of cells in a well~~
|
||||
* ~~Apache Commons CSV library writes entries a row at a time~~
|
||||
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing._
|
||||
* Re-implement command line arguments, to enable scripting and statistical simulation studies
|
||||
* Implement sample plates with random numbers of T cells per well.
|
||||
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not.
|
||||
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
||||
* Enable GraphML output in addition to serialized object binaries, for data portability
|
||||
* Custom vertex type with attribute for sequence occupancy?
|
||||
* Re-implement CDR1 matching method
|
||||
* Re-implement command line arguments, to enable scripting and statistical simulation studies
|
||||
* Implement Duan and Su's maximum weight matching algorithms
|
||||
* Implement Duan and Su's maximum weight matching algorithm
|
||||
* Add controllable algorithm-type parameter?
|
||||
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
|
||||
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm
|
||||
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
||||
* Add controllable heap-type parameter?
|
||||
* Implement sample plates with random numbers of T cells per well
|
||||
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not.
|
||||
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
||||
|
||||
|
||||
|
||||
## CITATIONS
|
||||
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
||||
* Duan, R., Su H. ["A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs."](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf) Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1413-1424. (2012)
|
||||
* K. Melhorn, St. Näher. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999)
|
||||
* M. Fredman, R. Tarjan. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987))
|
||||
* Melhorn, K., Näher, St. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999)
|
||||
* Fredman, M., Tarjan, R. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987))
|
||||
|
||||
## EXTERNAL LIBRARIES USED
|
||||
* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms
|
||||
@@ -257,4 +275,4 @@ BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought t
|
||||
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
|
||||
|
||||
## AUTHOR
|
||||
Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
||||
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
||||
@@ -8,6 +8,9 @@ public abstract class Equations {
|
||||
return (int) ((Math.random() * (max - min)) + min);
|
||||
}
|
||||
|
||||
//pValue calculation as described in original pairSEQ paper.
|
||||
//Included for comparison with original results.
|
||||
//Not used by BiGpairSEQ for matching.
|
||||
public static double pValue(Integer w, Integer w_a, Integer w_b, double w_ab_d) {
|
||||
int w_ab = (int) w_ab_d;
|
||||
double pv = 0.0;
|
||||
@@ -18,6 +21,9 @@ public abstract class Equations {
|
||||
return pv;
|
||||
}
|
||||
|
||||
//Implementation of the (corrected) probability equation from pairSEQ paper.
|
||||
//Included for comparison with original results.
|
||||
//Not used by BiGpairSEQ for matching.
|
||||
private static double probPairedByChance(Integer w, Integer w_a, Integer w_b, Integer w_ab){
|
||||
BigInteger numer1 = choose(w, w_ab);
|
||||
BigInteger numer2 = choose(w - w_ab, w_a - w_ab);
|
||||
@@ -30,10 +36,9 @@ public abstract class Equations {
|
||||
return prob.doubleValue();
|
||||
}
|
||||
|
||||
/*
|
||||
* This works because nC(k+1) = nCk * (n-k)/(k+1)
|
||||
* Since nC0 = 1, can start there and generate all the rest.
|
||||
*/
|
||||
|
||||
//This works because nC(k+1) = nCk * (n-k)/(k+1)
|
||||
//Since nC0 = 1, can start there and generate all the rest.
|
||||
public static BigInteger choose(final int N, final int K) {
|
||||
BigInteger nCk = BigInteger.ONE;
|
||||
for (int k = 0; k < K; k++) {
|
||||
|
||||
@@ -68,7 +68,7 @@ public class Plate {
|
||||
}
|
||||
Integer[] cellToAdd = cells.get(n).clone();
|
||||
for(int k = 0; k < cellToAdd.length; k++){
|
||||
if(Math.abs(rand.nextDouble()) < error){//error applied to each peptide
|
||||
if(Math.abs(rand.nextDouble()) < error){//error applied to each seqeunce
|
||||
cellToAdd[k] = -1;
|
||||
}
|
||||
}
|
||||
@@ -98,7 +98,7 @@ public class Plate {
|
||||
n = (int) Math.floor(m);
|
||||
Integer[] cellToAdd = cells.get(n).clone();
|
||||
for(int k = 0; k < cellToAdd.length; k++){
|
||||
if(Math.abs(rand.nextDouble()) < error){//error applied to each peptide
|
||||
if(Math.abs(rand.nextDouble()) < error){//error applied to each sequence
|
||||
cellToAdd[k] = -1;
|
||||
}
|
||||
}
|
||||
@@ -134,30 +134,30 @@ public class Plate {
|
||||
return wells;
|
||||
}
|
||||
|
||||
//returns a map of the counts of the peptide at cell index pIndex, in all wells
|
||||
public Map<Integer, Integer> assayWellsPeptideP(int... pIndices){
|
||||
return this.assayWellsPeptideP(0, size, pIndices);
|
||||
//returns a map of the counts of the sequence at cell index sIndex, in all wells
|
||||
public Map<Integer, Integer> assayWellsSequenceS(int... sIndices){
|
||||
return this.assayWellsSequenceS(0, size, sIndices);
|
||||
}
|
||||
|
||||
//returns a map of the counts of the peptide at cell index pIndex, in a specific well
|
||||
public Map<Integer, Integer> assayWellsPeptideP(int n, int... pIndices) { return this.assayWellsPeptideP(n, n+1, pIndices);}
|
||||
//returns a map of the counts of the sequence at cell index sIndex, in a specific well
|
||||
public Map<Integer, Integer> assayWellsSequenceS(int n, int... sIndices) { return this.assayWellsSequenceS(n, n+1, sIndices);}
|
||||
|
||||
//returns a map of the counts of the peptide at cell index pIndex, in a range of wells
|
||||
public Map<Integer, Integer> assayWellsPeptideP(int start, int end, int... pIndices) {
|
||||
//returns a map of the counts of the sequence at cell index sIndex, in a range of wells
|
||||
public Map<Integer, Integer> assayWellsSequenceS(int start, int end, int... sIndices) {
|
||||
Map<Integer,Integer> assay = new HashMap<>();
|
||||
for(int pIndex: pIndices){
|
||||
for(int pIndex: sIndices){
|
||||
for(int i = start; i < end; i++){
|
||||
countPeptides(assay, wells.get(i), pIndex);
|
||||
countSequences(assay, wells.get(i), pIndex);
|
||||
}
|
||||
}
|
||||
return assay;
|
||||
}
|
||||
//For the peptides at cell indices pIndices, counts number of unique peptides in the given well into the given map
|
||||
private void countPeptides(Map<Integer, Integer> wellMap, List<Integer[]> well, int... pIndices) {
|
||||
//For the sequences at cell indices sIndices, counts number of unique sequences in the given well into the given map
|
||||
private void countSequences(Map<Integer, Integer> wellMap, List<Integer[]> well, int... sIndices) {
|
||||
for(Integer[] cell : well) {
|
||||
for(int pIndex: pIndices){
|
||||
if(cell[pIndex] != -1){
|
||||
wellMap.merge(cell[pIndex], 1, (oldValue, newValue) -> oldValue + newValue);
|
||||
for(int sIndex: sIndices){
|
||||
if(cell[sIndex] != -1){
|
||||
wellMap.merge(cell[sIndex], 1, (oldValue, newValue) -> oldValue + newValue);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -31,54 +31,23 @@ public class PlateFileReader {
|
||||
BufferedReader reader = Files.newBufferedReader(Path.of(filename));
|
||||
CSVParser parser = new CSVParser(reader, plateFileFormat);
|
||||
){
|
||||
//old code for wells as rows
|
||||
// for(CSVRecord record: parser.getRecords()) {
|
||||
// List<Integer[]> well = new ArrayList<>();
|
||||
// for(String s: record) {
|
||||
// if(!"".equals(s)) {
|
||||
// String[] intString = s.replaceAll("\\[", "")
|
||||
// .replaceAll("]", "")
|
||||
// .replaceAll(" ", "")
|
||||
// .split(",");
|
||||
// //System.out.println(intString);
|
||||
// Integer[] arr = new Integer[intString.length];
|
||||
// for (int i = 0; i < intString.length; i++) {
|
||||
// arr[i] = Integer.valueOf(intString[i]);
|
||||
// }
|
||||
// well.add(arr);
|
||||
// }
|
||||
// }
|
||||
// wells.add(well);
|
||||
for(CSVRecord record: parser.getRecords()) {
|
||||
if (wells.size() == 0) {
|
||||
int num = 0;
|
||||
for (String s: record) {
|
||||
num++;
|
||||
}
|
||||
for (int i = 0; i < num; i++) {
|
||||
wells.add(new ArrayList<>());
|
||||
}
|
||||
} else {
|
||||
int i = 0;
|
||||
for (String s : record) {
|
||||
if (!"".equals(s)) { //if value isn't the empty string
|
||||
//get rid of brackets, split at commas into a string array
|
||||
String[] intsAsStrings = s.replaceAll("\\[", "")
|
||||
List<Integer[]> well = new ArrayList<>();
|
||||
for(String s: record) {
|
||||
if(!"".equals(s)) {
|
||||
String[] intString = s.replaceAll("\\[", "")
|
||||
.replaceAll("]", "")
|
||||
.replaceAll(" ", "")
|
||||
.split(",");
|
||||
//Make Integer array with the same values
|
||||
Integer[] arr = new Integer[intsAsStrings.length];
|
||||
for (int j = 0; j < intsAsStrings.length; j++) {
|
||||
arr[j] = Integer.valueOf(intsAsStrings[j]);
|
||||
//System.out.println(intString);
|
||||
Integer[] arr = new Integer[intString.length];
|
||||
for (int i = 0; i < intString.length; i++) {
|
||||
arr[i] = Integer.valueOf(intString[i]);
|
||||
}
|
||||
//Add Integer array to the correct well
|
||||
wells.get(i).add(arr);
|
||||
i++;
|
||||
well.add(arr);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
wells.add(well);
|
||||
}
|
||||
} catch(IOException ex){
|
||||
System.out.println("plate file " + filename + " not found.");
|
||||
|
||||
@@ -59,7 +59,7 @@ public class PlateFileWriter {
|
||||
}
|
||||
}
|
||||
|
||||
//this took forever, and I don't use it, because it makes reading data in a huge pain
|
||||
//this took forever
|
||||
List<List<String>> rows = new ArrayList<>();
|
||||
List<String> tmp = new ArrayList<>();
|
||||
for(int i = 0; i < wellsAsStrings.size(); i++){//List<Integer[]> w: wells){
|
||||
@@ -89,6 +89,7 @@ public class PlateFileWriter {
|
||||
CSVPrinter printer = new CSVPrinter(writer, plateFileFormat);
|
||||
){
|
||||
printer.printComment("Cell source file name: " + sourceFileName);
|
||||
printer.printComment("Each row represents one well on the plate.");
|
||||
printer.printComment("Plate size: " + size);
|
||||
printer.printComment("Error rate: " + error);
|
||||
printer.printComment("Concentrations: " + concenString);
|
||||
@@ -98,7 +99,7 @@ public class PlateFileWriter {
|
||||
else {
|
||||
printer.printComment("Std. dev.: " + stdDev);
|
||||
}
|
||||
printer.printRecords(rows);
|
||||
printer.printRecords(wellsAsStrings);
|
||||
} catch(IOException ex){
|
||||
System.out.println("Could not make new file named "+filename);
|
||||
System.err.println(ex);
|
||||
|
||||
@@ -57,8 +57,8 @@ public class Simulator {
|
||||
if(verbose){System.out.println("Cell maps made");}
|
||||
|
||||
if(verbose){System.out.println("Making well maps");}
|
||||
Map<Integer, Integer> allAlphas = samplePlate.assayWellsPeptideP(alphaIndex);
|
||||
Map<Integer, Integer> allBetas = samplePlate.assayWellsPeptideP(betaIndex);
|
||||
Map<Integer, Integer> allAlphas = samplePlate.assayWellsSequenceS(alphaIndex);
|
||||
Map<Integer, Integer> allBetas = samplePlate.assayWellsSequenceS(betaIndex);
|
||||
int alphaCount = allAlphas.size();
|
||||
if(verbose){System.out.println("All alphas count: " + alphaCount);}
|
||||
int betaCount = allBetas.size();
|
||||
@@ -296,8 +296,8 @@ public class Simulator {
|
||||
System.out.println("Cell maps made");
|
||||
|
||||
System.out.println("Making well maps");
|
||||
Map<Integer, Integer> allCDR3s = samplePlate.assayWellsPeptideP(cdr3Indices);
|
||||
Map<Integer, Integer> allCDR1s = samplePlate.assayWellsPeptideP(cdr1Indices);
|
||||
Map<Integer, Integer> allCDR3s = samplePlate.assayWellsSequenceS(cdr3Indices);
|
||||
Map<Integer, Integer> allCDR1s = samplePlate.assayWellsSequenceS(cdr1Indices);
|
||||
int CDR3Count = allCDR3s.size();
|
||||
System.out.println("all CDR3s count: " + CDR3Count);
|
||||
int CDR1Count = allCDR1s.size();
|
||||
@@ -591,26 +591,26 @@ public class Simulator {
|
||||
Map<Integer, Integer> rowSequenceCounts,
|
||||
Map<Integer,Integer> columnSequenceCounts,
|
||||
double[][] weights){
|
||||
Map<Integer, Integer> wellNRowPeptides = null;
|
||||
Map<Integer, Integer> wellNColumnPeptides = null;
|
||||
Map<Integer, Integer> wellNRowSequences = null;
|
||||
Map<Integer, Integer> wellNColumnSequences = null;
|
||||
int vertexStartValue = rowSequenceToVertexMap.size();
|
||||
int numWells = samplePlate.getSize();
|
||||
for (int n = 0; n < numWells; n++) {
|
||||
wellNRowPeptides = samplePlate.assayWellsPeptideP(n, rowSequenceIndices);
|
||||
for (Integer a : wellNRowPeptides.keySet()) {
|
||||
wellNRowSequences = samplePlate.assayWellsSequenceS(n, rowSequenceIndices);
|
||||
for (Integer a : wellNRowSequences.keySet()) {
|
||||
if(allRowSequences.containsKey(a)){
|
||||
rowSequenceCounts.merge(a, 1, (oldValue, newValue) -> oldValue + newValue);
|
||||
}
|
||||
}
|
||||
wellNColumnPeptides = samplePlate.assayWellsPeptideP(n, colSequenceIndices);
|
||||
for (Integer b : wellNColumnPeptides.keySet()) {
|
||||
wellNColumnSequences = samplePlate.assayWellsSequenceS(n, colSequenceIndices);
|
||||
for (Integer b : wellNColumnSequences.keySet()) {
|
||||
if(allColumnSequences.containsKey(b)){
|
||||
columnSequenceCounts.merge(b, 1, (oldValue, newValue) -> oldValue + newValue);
|
||||
}
|
||||
}
|
||||
for (Integer i : wellNRowPeptides.keySet()) {
|
||||
for (Integer i : wellNRowSequences.keySet()) {
|
||||
if(allRowSequences.containsKey(i)){
|
||||
for (Integer j : wellNColumnPeptides.keySet()) {
|
||||
for (Integer j : wellNColumnSequences.keySet()) {
|
||||
if(allColumnSequences.containsKey(j)){
|
||||
weights[rowSequenceToVertexMap.get(i)][columnSequenceToVertexMap.get(j) - vertexStartValue] += 1.0;
|
||||
}
|
||||
|
||||
@@ -258,7 +258,7 @@ public class UserInterface {
|
||||
while (!quit) {
|
||||
System.out.println();
|
||||
System.out.println("--------BiGPairSEQ SIMULATOR--------");
|
||||
System.out.println("ALPHA/BETA T-CELL RECEPTOR MATCHING");
|
||||
System.out.println("ALPHA/BETA T CELL RECEPTOR MATCHING");
|
||||
System.out.println(" USING WEIGHTED BIPARTITE GRAPHS ");
|
||||
System.out.println("------------------------------------");
|
||||
System.out.println("Please select an option:");
|
||||
@@ -681,18 +681,8 @@ public class UserInterface {
|
||||
System.out.println("This program simulates BiGpairSEQ, a graph theory based adaptation");
|
||||
System.out.println("of the pairSEQ algorithm for pairing T cell receptor sequences.");
|
||||
System.out.println();
|
||||
System.out.println("Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares");
|
||||
System.out.println("against a null distribution, BiGpairSEQ does not do any statistical calculations");
|
||||
System.out.println("directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing");
|
||||
System.out.println("the sample plate. The distinct TCRA and TCRB sequences form the two sets of vertices.");
|
||||
System.out.println("Every TCRA/TCRB pair that share a well are connected by an edge, with the edge weight");
|
||||
System.out.println("set to the number of wells in which both sequences appear. (Sequences in all wells are");
|
||||
System.out.println("filtered out prior to creating the graph, as there is no signal in their occupancy");
|
||||
System.out.println("pattern.) The problem of pairing TCRA/TCRB sequences thus reduces to the \"assignment");
|
||||
System.out.println("problem\" of finding a maximum weight matching on a bipartite graph--the subset of");
|
||||
System.out.println("vertex-disjoint edges whose weights sum to the maximum possible value.");
|
||||
System.out.println();
|
||||
System.out.println("For full documentation, see: https://gitea.ejsf.synology.me/efischer/BiGpairSEQ");
|
||||
System.out.println("For full documentation, view readme.md file distributed with this code");
|
||||
System.out.println("or visit https://gitea.ejsf.synology.me/efischer/BiGpairSEQ.");
|
||||
System.out.println();
|
||||
System.out.println("pairSEQ citation:");
|
||||
System.out.println("Howie, B., Sherwood, A. M., et. al.");
|
||||
|
||||
Reference in New Issue
Block a user