[SOLVED] Performance of "left join" in F#

Issue

I’m essentially trying to perform an sql left join in F# taking in two csv-files and producing a third. My files are not huge (around 200k rows) but I still experience horrific performance – in fact it is faster using vlookup in xl…

The csv’s both have an identifier "column" which holds compatible values but a value present in one csv is not guaranteed to be in the other.

I have tinkered with it, and I suspect the search through one csv for every row in the other is the killer.

EDIT: Replacing Array with Map improves performance a lot. But I guess this can still be improved further

Any ideas for improving this?

Some (pseudo) code:

open FSharp.Data

type DataLeft = 
    CsvProvider<Sample = "identifier;var1;var2", AssumeMissingValues = true, Schema = "identifier (string), var1, var2", Separators=";", HasHeaders=true, Encoding="UTF-8">

type DataRight = 
    CsvProvider<Sample = "identifier;var3;var4", AssumeMissingValues = true, Schema = "identifier (string), var3 (float option), var4 (float option)", Separators=";", HasHeaders=true, Encoding="UTF-8">

type Output =
    CsvProvider<Sample = "identifier;var1;var2;var3;var4", AssumeMissingValues = true, Schema = "identifier (string), var1, var2, var3 (float option), var4 (float option)", Separators=";", HasHeaders=true, Encoding="UTF-8">

let leftRows = DataLeft.Load(leftPath).Rows

// (slightly) more efficient to convert to array
let rightRows = DataRight.Load(rightPath).Rows |> Seq.toArray
**EDIT: let rightRows = DataRight.Load(rightPath).Rows |> Seq.map (fun row -> (row.Identifier, row)) |> Map.ofSeq**

let getMissingVars (row : DataLeft.Row) =
    let id = row.Identifier
    let rightRow = rightRows |> Array.tryFind (fun rRow -> rRow.Identifier = id)
**EDIT: let rightRow = rightRows.TryFind(id)**
    match rightRow with
    | None ->
        Output.Row(
            id,
            row.Var1,
            row.Var2,
            None,
            None)
    | Some realRow -> 
        Output.Row(
            id,
            row.Var1,
            row.Var2,
            realRow.Var3,
            realRow.Var4)

let rows = leftRows |> Seq.map getMissingVars

let csv = new Output(rows)
csv.Save(path = "outputPath")

Solution

Simply creating a dictionary for lookups solved the issue. I gave up trying to improve further, so posting this as answer.

As per edit replace

let rightRows = DataRight.Load(rightPath).Rows |> Seq.toArray

with

let rightRows = 
    DataRight.Load(rightPath).Rows 
    |> Seq.map (fun row -> (row.Identifier, row)) 
    |> Map.ofSeq

or some nicer dictionary. Then replace Array.tryFind with Map.tryFind

Answered By – SAOBab00n

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *