[SOLVED] Fast way to get different lines from 2 big files in Java

Issue

In Java, I have a method that reads two files, with each line being a GUID. The lines are unordered. The output is two new files with the lines that appear only on each file.

Example files:

| Input_1 | Input_2 |         | Output_1 | Output_2 |
| ------- | ------- |         | -------- | -------- |
| abcdef  | uvwxyz  |    >    | mnopqr   | uvwxyz   |
| ghijkl  | ghijkl  |
| mnopqr  | abcdef  |

I managed to do it fine with one Collection<String> for each file and some addAll() + removeAll() shenanigans, however the files are growing in size and this whole thing is taking some time now. Each file has about 600k lines.

Is there a fast way to improve this code just using another type of collection or I need to refactor my way of doing?

Code in question:

//Read two files
Collection<String> guidFile1 = readFileGuid(pathFile1);
Collection<String> guidFile2 = readFileGuid(pathFile2);

//Add file1 and remove file2
Collection<String> leftFromFile1 = new ArrayList<String>();
leftFromFile1.addAll(guidFile1);
leftFromFile1.removeAll(guidFile2);

//Add file2 and remove file1
Collection<String> leftFromFile2 = new ArrayList<String>();
leftFromFile2.addAll(guidFile2);
leftFromFile2.removeAll(guidFile1);

//Outputs
System.out.println("Leftover from file1: " + leftFromFile1.size());
System.out.println("Leftover from file2: " + leftFromFile2.size());

Solution

In your code, the removeAll is the most expensive operation. If both files are 100 lines long, then removeAll would perform 20,000x operations. If you include addAll, then it would perform a total of 20,200 operations. What you want is 200 operations. The way to do that is to use a Set.

  static HashSet<String> readFileGuid(String path) throws IOException{
      HashSet<String> guidFile = new HashSet<>();
      Scanner s = new Scanner(new File(path));
      while(s.hasNextLine()) 
        guidFile.add(s.nextLine());
      s.close();
      return guidFile;
  }
  
  static List<String> subtract(HashSet<String> s1, HashSet<String> s2){
      List<String> result = new ArrayList<>();
      Iterator<String> it = s1.iterator();
      while(it.hasNext()){
          String item = it.next();
          if (!s2.contains(item))
            result.add(item);
      }
      return result;
  }
  
  public static void main (String[]args) throws IOException{
      HashSet<String> guidFile1 = readFileGuid("input1.txt");
      HashSet<String> guidFile2 = readFileGuid("input2.txt");
      
      List<String> leftFromFile1 = subtract(guidFile1, guidFile2);
      List<String> leftFromFile2 = subtract(guidFile2, guidFile1);
      
      System.out.println("file1:" + leftFromFile1);
      System.out.println("file2:" + leftFromFile2);
  }

Answered By – Cheng Thao

Answer Checked By – Pedro (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *