8 Commits

10 changed files with 171 additions and 107 deletions

View File

@@ -66,6 +66,18 @@ Please select an option:
0) Exit
```
By default, the Options menu looks like this:
```
--------------OPTIONS---------------
1) Turn on cell sample file caching
2) Turn on plate file caching
3) Turn on graph/data file caching
4) Turn off serialized binary graph output
5) Turn on GraphML graph output
6) Maximum weight matching algorithm options
0) Return to main menu
```
### INPUT/OUTPUT
To run the simulation, the program reads and writes 4 kinds of files:
@@ -75,21 +87,24 @@ To run the simulation, the program reads and writes 4 kinds of files:
* Matching Results files in CSV format
These files are often generated in sequence. When entering filenames, it is not necessary to include the file extension
(.csv or .ser). When reading or writing files, the program will automatically add the correct extension to any filename without one.
(.csv or .ser). When reading or writing files, the program will automatically add the correct extension to any filename
without one.
To save file I/O time, the most recent instance of each of these four
files either generated or read from disk can be cached in program memory. This is could be important for Graph/Data files,
which can be several gigabytes in size. Since some simulations may require running multiple,
differently-configured BiGpairSEQ matchings on the same graph, keeping the most recent graph cached may reduce execution time.
(The manipulation necessary to re-use a graph incurs its own performance overhead, though, which may scale with graph
size faster than file I/O does. If so, caching is best for smaller graphs.)
When caching is active, subsequent uses of the same data file won't need to be read in again until another file of that type is used or generated,
files either generated or read from disk can be cached in program memory. When caching is active, subsequent uses of the
same data file won't need to be read in again until another file of that type is used or generated,
or caching is turned off for that file type. The program checks whether it needs to update its cached data by comparing
filenames as entered by the user. On encountering a new filename, the program flushes its cache and reads in the new file.
(Note that cached Graph/Data files must be transformed back into their original state after a matching experiment, which
may take some time. Whether file I/O or graph transformation takes longer for graph/data files is likely to be
device-specific.)
The program's caching behavior can be controlled in the Options menu. By default, all caching is OFF.
The program can optionally output Graph/Data files in .GraphML format (.graphml) for data portability. This can be
turned on in the Options menu. By default, GraphML output is OFF.
#### Cell Sample Files
Cell Sample files consist of any number of distinct "T cells." Every cell contains
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
@@ -181,14 +196,19 @@ Options for creating a Graph/Data file:
* The Cell Sample file to use
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
These files do not have a human-readable structure, and are not portable to other programs. (Export of graphs in a
portable data format may be implemented in the future. The tricky part is encoding the necessary metadata.)
These files do not have a human-readable structure, and are not portable to other programs.
(For portability to other software, turn on GraphML output in the Options menu. This will produce a .graphml file
for the weighted graph, with vertex attributes sequence, type, and occupancy data.)
---
#### Matching Results Files
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and
Data file. Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a serialized
binary Graph/Data file (.ser). (Because .graphML files are larger than .ser files, BiGpairSEQ_Sim supports .graphML
output only. Graph/data input must use a serialized binary.)
Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
@@ -258,25 +278,26 @@ slightly less time than the simulation itself. Real elapsed time from start to f
* ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~
* Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible.
* It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best.
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
* ~~Problem is variable number of cells in a well~~
* ~~Apache Commons CSV library writes entries a row at a time~~
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing._
* Re-implement command line arguments, to enable scripting and statistical simulation studies
* ~~Implement sample plates with random numbers of T cells per well.~~ DONE
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not.
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
* Enable GraphML output in addition to serialized object binaries, for data portability
* Custom vertex type with attribute for sequence occupancy?
* Re-implement CDR1 matching method
* Implement Duan and Su's maximum weight matching algorithm
* Add controllable algorithm-type parameter?
* ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE
* ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~
* ~~Add controllable heap-type parameter?~~
* Parameter implemented. For large graphs, Fibonacci heap wins. Now the new default.
* Parameter implemented. Fibonacci heap the current default.
* ~~Implement sample plates with random numbers of T cells per well.~~ DONE
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not.
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
* ~~Problem is variable number of cells in a well~~
* ~~Apache Commons CSV library writes entries a row at a time~~
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing.
* ~~Enable GraphML output in addition to serialized object binaries, for data portability~~ DONE
* ~~Custom vertex type with attribute for sequence occupancy?~~ABANDONED
* Have a branch where this is implemented, but there's a bug that broke matching. Don't currently have time to fix.
* Re-implement command line arguments, to enable scripting and statistical simulation studies
* Re-implement CDR1 matching method
* Implement Duan and Su's maximum weight matching algorithm
* Add controllable algorithm-type parameter?
* This would be fun and valuable, but probably take more time than I have for a hobby project.
## CITATIONS
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)

View File

@@ -1,6 +1,6 @@
import java.util.Random;
//main class. For choosing interface type and caching file data
//main class. For choosing interface type and holding settings
public class BiGpairSEQ {
private static final Random rand = new Random();
@@ -14,6 +14,8 @@ public class BiGpairSEQ {
private static boolean cachePlate = false;
private static boolean cacheGraph = false;
private static String priorityQueueHeapType = "FIBONACCI";
private static boolean outputBinary = true;
private static boolean outputGraphML = false;
public static void main(String[] args) {
if (args.length == 0) {
@@ -164,4 +166,11 @@ public class BiGpairSEQ {
public static void setFibonacciHeap() {
priorityQueueHeapType = "FIBONACCI";
}
public static boolean outputBinary() {return outputBinary;}
public static void setOutputBinary(boolean b) {outputBinary = b;}
public static boolean outputGraphML() {return outputGraphML;}
public static void setOutputGraphML(boolean b) {outputGraphML = b;}
}

View File

@@ -1,10 +1,37 @@
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.stream.IntStream;
public class CellSample {
private List<Integer[]> cells;
private Integer cdr1Freq;
public CellSample(Integer numDistinctCells, Integer cdr1Freq){
this.cdr1Freq = cdr1Freq;
List<Integer> numbersCDR3 = new ArrayList<>();
List<Integer> numbersCDR1 = new ArrayList<>();
Integer numDistCDR3s = 2 * numDistinctCells + 1;
IntStream.range(1, numDistCDR3s + 1).forEach(i -> numbersCDR3.add(i));
IntStream.range(numDistCDR3s + 1, numDistCDR3s + 1 + (numDistCDR3s / cdr1Freq) + 1).forEach(i -> numbersCDR1.add(i));
Collections.shuffle(numbersCDR3);
Collections.shuffle(numbersCDR1);
//Each cell represented by 4 values
//two CDR3s, and two CDR1s. First two values are CDR3s (alpha, beta), second two are CDR1s (alpha, beta)
List<Integer[]> distinctCells = new ArrayList<>();
for(int i = 0; i < numbersCDR3.size() - 1; i = i + 2){
Integer tmpCDR3a = numbersCDR3.get(i);
Integer tmpCDR3b = numbersCDR3.get(i+1);
Integer tmpCDR1a = numbersCDR1.get(i % numbersCDR1.size());
Integer tmpCDR1b = numbersCDR1.get((i+1) % numbersCDR1.size());
Integer[] tmp = {tmpCDR3a, tmpCDR3b, tmpCDR1a, tmpCDR1b};
distinctCells.add(tmp);
}
this.cells = distinctCells;
}
public CellSample(List<Integer[]> cells, Integer cdr1Freq){
this.cells = cells;
this.cdr1Freq = cdr1Freq;

View File

@@ -288,7 +288,7 @@ public class CommandLineInterface {
//for calling from command line
public static void makeCells(String filename, Integer numCells, Integer cdr1Freq){
CellSample sample = Simulator.generateCellSample(numCells, cdr1Freq);
CellSample sample = new CellSample(numCells, cdr1Freq);
CellFileWriter writer = new CellFileWriter(filename, sample);
writer.writeCellsToFile();
}

View File

@@ -1,35 +0,0 @@
import org.jgrapht.graph.SimpleWeightedGraph;
import org.jgrapht.nio.graphml.GraphMLImporter;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
public class GraphMLFileReader {
private String filename;
private SimpleWeightedGraph graph;
public GraphMLFileReader(String filename, SimpleWeightedGraph graph) {
if(!filename.matches(".*\\.graphml")){
filename = filename + ".graphml";
}
this.filename = filename;
this.graph = graph;
try(//don't need to close reader bc of try-with-resources auto-closing
BufferedReader reader = Files.newBufferedReader(Path.of(filename));
){
GraphMLImporter<SimpleWeightedGraph, BufferedReader> importer = new GraphMLImporter<>();
importer.importGraph(graph, reader);
}
catch (IOException ex) {
System.out.println("Graph file " + filename + " not found.");
System.err.println(ex);
}
}
public SimpleWeightedGraph getGraph() { return graph; }
}

View File

@@ -1,4 +1,8 @@
import org.jgrapht.graph.DefaultWeightedEdge;
import org.jgrapht.graph.SimpleWeightedGraph;
import org.jgrapht.nio.Attribute;
import org.jgrapht.nio.AttributeType;
import org.jgrapht.nio.DefaultAttribute;
import org.jgrapht.nio.dot.DOTExporter;
import org.jgrapht.nio.graphml.GraphMLExporter;
@@ -7,25 +11,69 @@ import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
public class GraphMLFileWriter {
String filename;
SimpleWeightedGraph graph;
GraphWithMapData data;
public GraphMLFileWriter(String filename, SimpleWeightedGraph graph) {
public GraphMLFileWriter(String filename, GraphWithMapData data) {
if(!filename.matches(".*\\.graphml")){
filename = filename + ".graphml";
}
this.filename = filename;
this.graph = graph;
this.data = data;
}
// public void writeGraphToFile() {
// try(BufferedWriter writer = Files.newBufferedWriter(Path.of(filename), StandardOpenOption.CREATE_NEW);
// ){
// GraphMLExporter<SimpleWeightedGraph, BufferedWriter> exporter = new GraphMLExporter<>();
// exporter.exportGraph(graph, writer);
// } catch(IOException ex){
// System.out.println("Could not make new file named "+filename);
// System.err.println(ex);
// }
// }
public void writeGraphToFile() {
SimpleWeightedGraph graph = data.getGraph();
Map<Integer, Integer> vertexToAlphaMap = data.getPlateVtoAMap();
Map<Integer, Integer> vertexToBetaMap = data.getPlateVtoBMap();
Map<Integer, Integer> alphaOccs = data.getAlphaWellCounts();
Map<Integer, Integer> betaOccs = data.getBetaWellCounts();
try(BufferedWriter writer = Files.newBufferedWriter(Path.of(filename), StandardOpenOption.CREATE_NEW);
){
GraphMLExporter<SimpleWeightedGraph, BufferedWriter> exporter = new GraphMLExporter<>();
//create exporter. Let the vertex labels be the unique ids for the vertices
GraphMLExporter<Integer, SimpleWeightedGraph<Vertex, DefaultWeightedEdge>> exporter = new GraphMLExporter<>(v -> v.toString());
//set to export weights
exporter.setExportEdgeWeights(true);
//set type, sequence, and occupancy attributes for each vertex
exporter.setVertexAttributeProvider( v -> {
Map<String, Attribute> attributes = new HashMap<>();
if(vertexToAlphaMap.containsKey(v)) {
attributes.put("type", DefaultAttribute.createAttribute("CDR3 Alpha"));
attributes.put("sequence", DefaultAttribute.createAttribute(vertexToAlphaMap.get(v)));
attributes.put("occupancy", DefaultAttribute.createAttribute(
alphaOccs.get(vertexToAlphaMap.get(v))));
}
else if(vertexToBetaMap.containsKey(v)) {
attributes.put("type", DefaultAttribute.createAttribute("CDR3 Beta"));
attributes.put("sequence", DefaultAttribute.createAttribute(vertexToBetaMap.get(v)));
attributes.put("occupancy", DefaultAttribute.createAttribute(
betaOccs.get(vertexToBetaMap.get(v))));
}
return attributes;
});
//register the attributes
exporter.registerAttribute("type", GraphMLExporter.AttributeCategory.NODE, AttributeType.STRING);
exporter.registerAttribute("sequence", GraphMLExporter.AttributeCategory.NODE, AttributeType.STRING);
exporter.registerAttribute("occupancy", GraphMLExporter.AttributeCategory.NODE, AttributeType.STRING);
//export the graph
exporter.exportGraph(graph, writer);
} catch(IOException ex){
System.out.println("Could not make new file named "+filename);
@@ -33,3 +81,4 @@ public class GraphMLFileWriter {
}
}
}

View File

@@ -4,7 +4,6 @@ import org.jgrapht.graph.SimpleWeightedGraph;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Set;
public interface GraphModificationFunctions {

View File

@@ -74,7 +74,7 @@ public class InteractiveInterface {
System.out.println(ex);
sc.next();
}
CellSample sample = Simulator.generateCellSample(numCells, cdr1Freq);
CellSample sample = new CellSample(numCells, cdr1Freq);
assert filename != null;
System.out.println("Writing cells to file");
CellFileWriter writer = new CellFileWriter(filename, sample);
@@ -252,7 +252,6 @@ public class InteractiveInterface {
String filename = null;
String cellFile = null;
String plateFile = null;
try {
String str = "\nGenerating bipartite weighted graph encoding occupancy overlap data ";
str = str.concat("\nrequires a cell sample file and a sample plate file.");
@@ -310,9 +309,16 @@ public class InteractiveInterface {
List<Integer[]> cells = cellSample.getCells();
GraphWithMapData data = Simulator.makeGraph(cells, plate, true);
assert filename != null;
GraphDataObjectWriter dataWriter = new GraphDataObjectWriter(filename, data);
dataWriter.writeDataToFile();
System.out.println("Graph and Data file written to: " + filename);
if(BiGpairSEQ.outputBinary()) {
GraphDataObjectWriter dataWriter = new GraphDataObjectWriter(filename, data);
dataWriter.writeDataToFile();
System.out.println("Serialized binary graph/data file written to: " + filename);
}
if(BiGpairSEQ.outputGraphML()) {
GraphMLFileWriter graphMLWriter = new GraphMLFileWriter(filename, data);
graphMLWriter.writeGraphToFile();
System.out.println("GraphML file written to: " + filename);
}
if(BiGpairSEQ.cacheGraph()) {
BiGpairSEQ.setGraphInMemory(data, filename);
@@ -500,7 +506,9 @@ public class InteractiveInterface {
System.out.println("1) Turn " + getOnOff(!BiGpairSEQ.cacheCells()) + " cell sample file caching");
System.out.println("2) Turn " + getOnOff(!BiGpairSEQ.cachePlate()) + " plate file caching");
System.out.println("3) Turn " + getOnOff(!BiGpairSEQ.cacheGraph()) + " graph/data file caching");
System.out.println("4) Maximum weight matching algorithm options");
System.out.println("4) Turn " + getOnOff(!BiGpairSEQ.outputBinary()) + " serialized binary graph output");
System.out.println("5) Turn " + getOnOff(!BiGpairSEQ.outputGraphML()) + " GraphML graph output");
System.out.println("6) Maximum weight matching algorithm options");
System.out.println("0) Return to main menu");
try {
input = sc.nextInt();
@@ -508,7 +516,9 @@ public class InteractiveInterface {
case 1 -> BiGpairSEQ.setCacheCells(!BiGpairSEQ.cacheCells());
case 2 -> BiGpairSEQ.setCachePlate(!BiGpairSEQ.cachePlate());
case 3 -> BiGpairSEQ.setCacheGraph(!BiGpairSEQ.cacheGraph());
case 4 -> algorithmOptions();
case 4 -> BiGpairSEQ.setOutputBinary(!BiGpairSEQ.outputBinary());
case 5 -> BiGpairSEQ.setOutputGraphML(!BiGpairSEQ.outputGraphML());
case 6 -> algorithmOptions();
case 0 -> backToMain = true;
default -> System.out.println("Invalid input");
}

View File

@@ -23,30 +23,6 @@ public class Simulator implements GraphModificationFunctions {
private static final int cdr1AlphaIndex = 2;
private static final int cdr1BetaIndex = 3;
public static CellSample generateCellSample(Integer numDistinctCells, Integer cdr1Freq) {
//In real T cells, CDR1s have about one third the diversity of CDR3s
List<Integer> numbersCDR3 = new ArrayList<>();
List<Integer> numbersCDR1 = new ArrayList<>();
Integer numDistCDR3s = 2 * numDistinctCells + 1;
IntStream.range(1, numDistCDR3s + 1).forEach(i -> numbersCDR3.add(i));
IntStream.range(numDistCDR3s + 1, numDistCDR3s + 1 + (numDistCDR3s / cdr1Freq) + 1).forEach(i -> numbersCDR1.add(i));
Collections.shuffle(numbersCDR3);
Collections.shuffle(numbersCDR1);
//Each cell represented by 4 values
//two CDR3s, and two CDR1s. First two values are CDR3s (alpha, beta), second two are CDR1s (alpha, beta)
List<Integer[]> distinctCells = new ArrayList<>();
for(int i = 0; i < numbersCDR3.size() - 1; i = i + 2){
Integer tmpCDR3a = numbersCDR3.get(i);
Integer tmpCDR3b = numbersCDR3.get(i+1);
Integer tmpCDR1a = numbersCDR1.get(i % numbersCDR1.size());
Integer tmpCDR1b = numbersCDR1.get((i+1) % numbersCDR1.size());
Integer[] tmp = {tmpCDR3a, tmpCDR3b, tmpCDR1a, tmpCDR1b};
distinctCells.add(tmp);
}
return new CellSample(distinctCells, cdr1Freq);
}
//Make the graph needed for matching CDR3s
public static GraphWithMapData makeGraph(List<Integer[]> distinctCells, Plate samplePlate, boolean verbose) {
Instant start = Instant.now();
@@ -260,6 +236,7 @@ public class Simulator implements GraphModificationFunctions {
}
//Metadata comments for CSV file
String algoType = "LEDA book with heap: " + heapType;
int min = Math.min(alphaCount, betaCount);
//rate of attempted matching
double attemptRate = (double) (trueCount + falseCount) / min;
@@ -290,6 +267,7 @@ public class Simulator implements GraphModificationFunctions {
Map<String, String> metadata = new LinkedHashMap<>();
metadata.put("sample plate filename", data.getSourceFilename());
metadata.put("graph filename", dataFilename);
metadata.put("algorithm type", algoType);
metadata.put("well populations", wellPopulationsString);
metadata.put("total alphas found", alphaCount.toString());
metadata.put("total betas found", betaCount.toString());
@@ -690,7 +668,7 @@ public class Simulator implements GraphModificationFunctions {
private static Map<Integer, Integer> makeVertexToSequenceMap(Map<Integer, Integer> sequences, Integer startValue) {
Map<Integer, Integer> map = new LinkedHashMap<>(); //LinkedHashMap to preserve order of entry
Integer index = startValue;
Integer index = startValue; //is this necessary? I don't think I use this.
for (Integer k: sequences.keySet()) {
map.put(index, k);
index++;

View File

@@ -1,14 +1,20 @@
public class Vertex {
private final Integer peptide;
private final Integer vertexLabel;
private final Integer sequence;
private final Integer occupancy;
public Vertex(Integer peptide, Integer occupancy) {
this.peptide = peptide;
public Vertex(Integer vertexLabel, Integer sequence, Integer occupancy) {
this.vertexLabel = vertexLabel;
this.sequence = sequence;
this.occupancy = occupancy;
}
public Integer getPeptide() {
return peptide;
public Integer getVertexLabel() { return vertexLabel; }
public Integer getSequence() {
return sequence;
}
public Integer getOccupancy() {