Notes from the field on ColdFusion (or related) technical issues.

Friday, January 16, 2009

ColdFusion: Fast Query to File Export


A semi-common data processing task is to take the result of a query and write it to a file. People usually do this in ColdFusion something like this:


<cfsetting requesttimeout="1200" enablecfoutputonly="yes">
<cfset tab = chr(9)>

<cfquery datasource="Northwind" name="qProducts">
  SELECT p1.ProductID, p1.ProductName, case when p2.productname> 'B' then p2.productName else null end productName2
  FROM Products p1
  CROSS JOIN Products p2
  CROSS JOIN Products p3
  CROSS JOIN Products p4
  WHERE p1.ProductName <> 'foo'
  ORDER BY p1.ProductID
</cfquery>

<cfloop query="qProducts">
  <cfset line = ProductID
        & tab & ProductName
        & tab & ProductName2>
  <cffile action="append"
        file="/out.txt"
        output="#line#"
        addNewLine="yes"
    >
</cfloop>
<cfsetting enablecfoutputonly="no">


This works fine for small result sets. In my case, however, I Cartesian-joined the Northwind "Products" table to itself a few times, so that this query returns 35 million rows; the page above runs for ~3 minutes on my laptop and then dies with a java.lang.OutOfMemoryError.

I'd previously written a set of ColdFusion tags that could access JDBC directly from ColdFusion and process the resultset one row at a time, thereby avoiding out-of-memory errors:


<cfsetting requesttimeout="1200" enablecfoutputonly="yes">
<cfset tab = chr(9)>

<!--- cf_jdbcQuery will return a java ResultSet object instead of a ColdFusion Query object --->
<cf_jdbcquery datasource="Northwind" name="qProducts"><cfoutput>
  SELECT p1.ProductID, p1.ProductName, case when p2.productname> 'B' then p2.productName else null end productName2
  FROM Products p1
  CROSS JOIN Products p2
  CROSS JOIN Products p3
  CROSS JOIN Products p4
  WHERE p1.ProductName <> 'foo'
  ORDER BY p1.ProductID
</cfoutput></cf_jdbcquery>

<cf_jdbcqueryloop query="qProducts">
  <cfset line = qProducts.ProductID
        & tab & qProducts.ProductName
        & tab & qProducts.ProductName2>
  <cffile action="append"
        file="/out.txt"
        output="#line#"
        addNewLine="yes"
    >
</cf_jdbcqueryloop>
<cfsetting enablecfoutputonly="no">


This ran for 20mins (until it hit the timeout),writing about 37MB to the output file. Still... 20 minutes? I think we can do better.

So, I wrote a Java class to do this in a very small loop, to see the difference in speed. The Java source file ("FastResultsetToFile.java") is this:


import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;

/**
* @author Daryl Banttari
* Released to the public domain.  Do with it what you will!
*/
public class FastResultsetToFile {

  public static final byte[] CRLF = { 13, 10 };

  /**
   * Very quickly sends data from an any size query to an output file.
   *
   * @throws IOException
   * @throws SQLException
   */
  public static int exportResultsetToFile(ResultSet rs, String fileName, String delim, String charset) throws IOException, SQLException {
      int rowcount = 0;
      File file = new File(fileName);
      byte[] delimBytes = delim.getBytes(charset);
      byte[] CRLF = "\r\n".getBytes(charset);
      BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(file));
      try {
          ResultSetMetaData rsmd = rs.getMetaData();
          int numCols = rsmd.getColumnCount();
          // iterate over resultset
          while (rs.next()) {
              // write each column
              for (int i = 1; i <= numCols; ++i) {
                  String s = rs.getString(i);
                  if (s != null) {
                      out.write(s.getBytes(charset));
                  }
                  if (i == numCols) {
                      // end of row, write CRLF
                      out.write(CRLF);
                  }
                  else {
                      // write delimiter between data
                      out.write(delimBytes);
                  }
              }
              rowcount++;
          }
      }
      finally {
          if (out != null) {
              out.flush();
              out.close();
          }
          if (rs != null) {
              rs.close();
          }
      }
      return rowcount;
  }

}


To compile this, save the FastResultsetToFile.java to disk, then type "javac FastResultsetToFile.java". This will produce a file called FastResultsetToFile.class. Depending on how you have ColdFusion installed, the file should be placed in either:

\coldfusionmx\wwwroot\WEB-INF\classes

or

\JRun4\servers\cfusion\cfusion-ear\cfusion-war\WEB-INF\classes

The ColdFusion page looks like this:


<!--- cf_jdbcQuery will return a java ResultSet object instead of a ColdFusion Query object --->
<cf_jdbcQuery datasource="Northwind" name="qProducts">
  SELECT p1.ProductID, p1.ProductName, case when p2.productname> 'B' then p2.productName else null end productName2
  FROM Products p1
  CROSS JOIN Products p2
  CROSS JOIN Products p3
  CROSS JOIN Products p4
  WHERE p1.ProductName <> 'foo'
  ORDER BY p1.ProductID
</cf_jdbcQuery>

<cfset application.fastFileWriter = createObject("java", "FastResultsetToFile")>
<cfset application.fastFileWriter.exportResultsetToFile(qProducts, "/out.txt", chr(9), "UTF-8")>

Done.


This completes, writing all 35,153,041 rows into a 1.3 GB file in about 5 minutes. Not bad! If we'd let cf_jdbcQueryLoop complete, it'd have taken nearly 12 hours (based on time and file size ratio).

While you can do a lot right inside ColdFusion, it's important to remember that it's not the right tool for every job.

8 comments:

  1. Crazy question,

    Why are you writing to the disk after every loop?

    ReplyDelete
  2. Great tidbit post here Daryl. Would it be faster if you used a Java StringBuffer to build a few hundred rows before writing to disk? I think it would be faster to do that versus writing to disk on each column value. It would be an interesting follow up to this blog post to explore that.

    ReplyDelete
  3. I'm using a BufferedOutputStream which serves the same purpose as your StringBuffer suggestion would, but at a lower layer.

    @Hem,

    The Java example is not writing to disk after every loop because it's using the BufferedOutputStream; the ColdFusion example is writing on every iteration for the sake of code clarity. If I were buffering output, it'd be noticeably faster, but still not nearly as fast as the Java example.

    ReplyDelete
  4. Daryl, I should have noticed that you used the BufferedOutputStream -- I skipped right over it. Do you think it would be more efficient if the buffer size was set higher than 512 bytes which is the default if you don't pass one in the constructor? I know that the flush() method can be a little expensive and I assume that is being called each time internally when the buffer hits the threshold.

    ReplyDelete
  5. I was playing with your jdbcQuery tags, and I noticed in the jdbcQueryTest.cfm file you sleep for 4 seconds, and I was wondering why you were doing that.

    ReplyDelete
  6. One other thing for you...I'm having an issue with connections not closing.

    I see that you're calling cnx.close() in a cfcatch, but I don't see it being called anywhere else. Just thought I'd let you know.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete