AppliedFSharp


Pipeline walkthrough

The pipeline used in the Pipeline.DefaultConfig.runDefaultPipeline function resembles the steps i used while scripting this tool. The original script can be found here (its a bit messy dont me). The besic steps are:

  1. Creation of all possible primer pairs of length n flanking a template of length m for the input cDNA/gene
  2. Creating a blast search database from the cDNA source/transcriptome/genome using the ]Blast BioContainer
  3. Blasting all primer pairs against the search database using the Blast BioContainer
  4. Parsing the blast results in a deedle frame to handle grouping and filtering steps of the data
  5. Calculating self hybridization/internal Loop/fwd-rev primer hybridization energy using the IntaRNA BioContainer

I added an example dataset in form of Chlamydomonas reinhardtii cDNA here.

to test the pipeline on this dataset you can use the following script:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
open AppliedFSharp
open BioFSharp
open BioFSharp.BioTools
open BioFSharp.IO



//Docker client pipe
let client = Docker.connect "npipe://./pipe/docker_engine"

//IntaRNA image name. can differ for you. path should be absolute so make sure you change this unless your name is Kevin ;)
let IntaRNAImage =  Docker.ImageName @"quay.io/biocontainers/intarna:2.4.1--pl526hfac12b2_0"

//Keep IntaRNA container up and running
let intaRNAContext = 
    BioContainer.initBcContextWithMountAsync client IntaRNAImage @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data"
    |> Async.RunSynchronously

//Blast image name. can differ for you.
let ImageBlast = Docker.DockerId.ImageId "blast"

//Keep Blast container up and running. path should be absolute so make sure you change this unless your name is Kevin ;)
let blastContext = 
    BioContainer.initBcContextWithMountAsync client ImageBlast @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data"
    |> Async.RunSynchronously


//Arbitrary cDNA from the cDNA pool
let testGene = 
    FastA.fromFile (BioArray.ofNucleotideString)(__SOURCE_DIRECTORY__ +  @"..\..\..\..\docsrc\content\data\Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.cdna.all.fa")
    |> Seq.item 1337
 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
let testResult = 
    Pipeline.DefaultConfig.runDefaultPipeline 
        // Primer generation parameters
        20 100 
        blastContext 
        // paths for saving outputs. again, as these are mounted into the containers, absolute paths should be used
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.cdna.all.fa"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineQueryTest.fasta"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineTestBlastOutput.fasta"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineTestBlastOutputCleaned.fasta"
        intaRNAContext
        testGene

Step by step

Creation of all possible primer pairs of length n flanking a template of length m for the input cDNA/gene

the generatePrimerPairs function creates these primer pairs by moving over a sliding window of size 2*n+m and taking the flanking regions of length n

1: 
let testPairs = Pipeline.generatePrimerPairs 10 100 testGene

Creating a blast search database from the cDNA source/transcriptome/genome using the Blast BioContainer

The preparePrimerBlastSearch prepares a blast database for subsequent blast searches. For best feature calculation, use the full cDNA transcriptome/Genome of the organism

1: 
2: 
3: 
4: 
5: 
6: 
let _ = 
    Pipeline.preparePrimerBlastSearch 
        blastContext 
        "C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.cdna.all.fa"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineQueryTest.fasta"
        testPairs

Blasting all primer pairs against the search database using the Blast BioContainer

the blastPrimerPairs blasts all generated primer pairs against the previously generated database. results are written to a file of choice.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
let _ = 
    Pipeline.blastPrimerPairs
        blastContext
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.cdna.all.fa"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineQueryTest.fasta"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineTestBlastOutput.fasta"
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineTestBlastOutputCleaned.fasta"

Parsing the blast results in a deedle frame to handle grouping and filtering steps of the data

Calculating self hybridization/internal Loop/fwd-rev primer hybridization energy using the IntaRNA BioContainer

this is both handled by the getResultFrame function, which parses the blast Results, Calculates hybridization energy features for the given blast results and groups them by query id and direction(fwd/rev)

The result of this function is a frame that contains the features for all primer pairs.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
let result2 =
    Pipeline.getResultFrame 
        intaRNAContext 
        @"C:\Users\Kevin\source\repos\AppliedFSharp\docsrc\content\data\PipeLineTestBlastOutputCleaned.fasta"
        //this is unique for my case, you may want to add another converting function here. This converter gets the query id from the fasta files.
        (fun (x:string) -> x.Split(' ').[0].Trim())
        testPairs

Short Conclusion

While this may be not the flashiest algorithm, i think my post highlights the strengths of F# in data science pretty well. In a little more than 3 days i was able to predict oligonucleotide interactions, blast sequences against genomes and group the results in a safe and visually acessible way during th exploratory data analysis.

Furthermore, the script was easily transferable to .fs files and therefore compiled as library in no time. I think F# has great applications in research and me and my group aswell will continue to use it for all kinds of (bioinformatic) workflows

namespace System
namespace System.IO
val dependencies : string list
val resolveDockerDotnetDependecies : unit -> unit
type Environment =
  static member CommandLine : string
  static member CurrentDirectory : string with get, set
  static member CurrentManagedThreadId : int
  static member Exit : exitCode:int -> unit
  static member ExitCode : int with get, set
  static member ExpandEnvironmentVariables : name:string -> string
  static member FailFast : message:string -> unit + 1 overload
  static member GetCommandLineArgs : unit -> string[]
  static member GetEnvironmentVariable : variable:string -> string + 1 overload
  static member GetEnvironmentVariables : unit -> IDictionary + 1 overload
  ...
  nested type SpecialFolder
  nested type SpecialFolderOption
Environment.SetEnvironmentVariable(variable: string, value: string) : unit
Environment.SetEnvironmentVariable(variable: string, value: string, target: EnvironmentVariableTarget) : unit
Environment.GetEnvironmentVariable(variable: string) : string
Environment.GetEnvironmentVariable(variable: string, target: EnvironmentVariableTarget) : string
module Seq

from Microsoft.FSharp.Collections
val iter : action:('T -> unit) -> source:seq<'T> -> unit
val dep : string
val path : string
type Path =
  static val DirectorySeparatorChar : char
  static val AltDirectorySeparatorChar : char
  static val VolumeSeparatorChar : char
  static val InvalidPathChars : char[]
  static val PathSeparator : char
  static member ChangeExtension : path:string * extension:string -> string
  static member Combine : [<ParamArray>] paths:string[] -> string + 3 overloads
  static member GetDirectoryName : path:string -> string
  static member GetExtension : path:string -> string
  static member GetFileName : path:string -> string
  ...
Path.Combine([<ParamArray>] paths: string []) : string
Path.Combine(path1: string, path2: string) : string
Path.Combine(path1: string, path2: string, path3: string) : string
Path.Combine(path1: string, path2: string, path3: string, path4: string) : string
namespace AppliedFSharp
namespace BioFSharp
namespace BioFSharp.BioTools
namespace BioFSharp.IO
val client : Docker.DotNet.DockerClient
Multiple items
module Docker

from BioFSharp.BioTools

--------------------
namespace Docker
val connect : str:string -> Docker.DotNet.DockerClient
val IntaRNAImage : Docker.DockerId
union case Docker.DockerId.ImageName: string -> Docker.DockerId
val intaRNAContext : BioContainer.BcContext
module BioContainer

from BioFSharp.BioTools
val initBcContextWithMountAsync : connection:Docker.DotNet.DockerClient -> image:Docker.DockerId -> hostdirectory:string -> Async<BioContainer.BcContext>
Multiple items
type Async =
  static member AsBeginEnd : computation:('Arg -> Async<'T>) -> ('Arg * AsyncCallback * obj -> IAsyncResult) * (IAsyncResult -> 'T) * (IAsyncResult -> unit)
  static member AwaitEvent : event:IEvent<'Del,'T> * ?cancelAction:(unit -> unit) -> Async<'T> (requires delegate and 'Del :> Delegate)
  static member AwaitIAsyncResult : iar:IAsyncResult * ?millisecondsTimeout:int -> Async<bool>
  static member AwaitTask : task:Task -> Async<unit>
  static member AwaitTask : task:Task<'T> -> Async<'T>
  static member AwaitWaitHandle : waitHandle:WaitHandle * ?millisecondsTimeout:int -> Async<bool>
  static member CancelDefaultToken : unit -> unit
  static member Catch : computation:Async<'T> -> Async<Choice<'T,exn>>
  static member Choice : computations:seq<Async<'T option>> -> Async<'T option>
  static member FromBeginEnd : beginAction:(AsyncCallback * obj -> IAsyncResult) * endAction:(IAsyncResult -> 'T) * ?cancelAction:(unit -> unit) -> Async<'T>
  ...

--------------------
type Async<'T> =
static member Async.RunSynchronously : computation:Async<'T> * ?timeout:int * ?cancellationToken:Threading.CancellationToken -> 'T
val ImageBlast : Docker.DockerId
type DockerId =
  | ImageId of string
  | ImageName of string
  | ContainerId of string
  | ContainerName of string
  | Tag of string * string
    override ToString : unit -> string
union case Docker.DockerId.ImageId: string -> Docker.DockerId
val blastContext : BioContainer.BcContext
val testGene : FastA.FastaItem<BioArray.BioArray<Nucleotides.Nucleotide>>
module FastA

from BioFSharp.IO
val fromFile : converter:(seq<char> -> 'a) -> filePath:string -> seq<FastA.FastaItem<'a>>
module BioArray

from BioFSharp
val ofNucleotideString : s:#seq<char> -> BioArray.BioArray<Nucleotides.Nucleotide>
val item : index:int -> source:seq<'T> -> 'T
val testResult : Deedle.Frame<(string * (string * string)),string>
module Pipeline

from AppliedFSharp
module DefaultConfig

from AppliedFSharp.Pipeline
val runDefaultPipeline : primerLength:int -> templateLength:int -> blastContext:BioContainer.BcContext -> dbPath:string -> queryFastaOutputPath:string -> blastResultOutputPath:string -> cleanedBlastResultOutputPath:string -> intaRNAContext:BioContainer.BcContext -> gene:FastA.FastaItem<BioArray.BioArray<Nucleotides.Nucleotide>> -> Deedle.Frame<(string * (string * string)),string>
val testPairs : FastA.FastaItem<Nucleotides.Nucleotide []> []
val generatePrimerPairs : length:int -> templateSpan:int -> item:FastA.FastaItem<BioArray.BioArray<Nucleotides.Nucleotide>> -> FastA.FastaItem<Nucleotides.Nucleotide []> []
val preparePrimerBlastSearch : blastContext:BioContainer.BcContext -> dbPath:string -> queryFastaOutputPath:string -> queries:FastA.FastaItem<Nucleotides.Nucleotide []> [] -> unit
val blastPrimerPairs : blastContext:BioContainer.BcContext -> dbPath:string -> queryFastaPath:string -> blastResultOutputPath:string -> cleanedBlastResultOutputPath:string -> unit
val result2 : Deedle.Frame<(string * (string * string)),string>
val getResultFrame : intaRNAContext:BioContainer.BcContext -> cleanedBlastResultPath:string -> fastaHeaderConverter:(string -> string) -> queries:FastA.FastaItem<Nucleotides.Nucleotide []> [] -> Deedle.Frame<(string * (string * string)),string>
val x : string
Multiple items
val string : value:'T -> string

--------------------
type string = String
String.Split([<ParamArray>] separator: char []) : string []
String.Split(separator: string [], options: StringSplitOptions) : string []
String.Split(separator: char [], options: StringSplitOptions) : string []
String.Split(separator: char [], count: int) : string []
String.Split(separator: string [], count: int, options: StringSplitOptions) : string []
String.Split(separator: char [], count: int, options: StringSplitOptions) : string []
Fork me on GitHub