Reading a list of Files as a Java 8 Stream

Question

I have a (possibly long) list of binary files that I want to read lazily. There will be too many files to load into memory. I'm currently reading them as a MappedByteBuffer with FileChannel.map(), but that probably isn't required. I want the method readBinaryFiles(...) to return a Java 8 Stream so I can lazy load the list of files as I access them.

    public List<FileDataMetaData> readBinaryFiles(
    List<File> files, 
    int numDataPoints, 
    int dataPacketSize )
    throws
    IOException {

    List<FileDataMetaData> fmdList = new ArrayList<FileDataMetaData>();

    IOException lastException = null;
    for (File f: files) {

        try {
            FileDataMetaData fmd = readRawFile(f, numDataPoints, dataPacketSize);
            fmdList.add(fmd);
        } catch (IOException e) {
            logger.error("", e);
            lastException = e;
        }
    }

    if (null != lastException)
        throw lastException;

    return fmdList;
}


//  The List<DataPacket> returned will be in the same order as in the file.
public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize) throws IOException {

    FileDataMetaData fmd;
    FileChannel fileChannel = null;
    try {
        fileChannel = new RandomAccessFile(file, "r").getChannel();
        long fileSz = fileChannel.size();
        ByteBuffer bbRead = ByteBuffer.allocate((int) fileSz);
        MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);

        buffer.get(bbRead.array());
        List<DataPacket> dataPacketList = new ArrayList<DataPacket>();

        while (bbRead.hasRemaining()) {

            int channelId = bbRead.getInt();
            long timestamp = bbRead.getLong();
            int[] data = new int[numDataPoints];
            for (int i=0; i<numDataPoints; i++) 
                data[i] = bbRead.getInt();

            DataPacket dp = new DataPacket(channelId, timestamp, data);
            dataPacketList.add(dp);
        }

        fmd = new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);

    } catch (IOException e) {
        logger.error("", e);
        throw e;
    } finally {
        if (null != fileChannel) {
            try {
                fileChannel.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    return fmd;
}

Returning fmdList.Stream() from readBinaryFiles(...) won't accomplish this because the file contents will already have been read into memory, which I won't be able to do.

The other approaches to reading the contents of multiple files as a Stream rely on using Files.lines(), but I need to read binary files.

I'm, open to doing this in Scala or golang if those languages have better support for this use case than Java.

I'd appreciate any pointers on how to read the contents of multiple binary files lazily.


Show source
| java   | java-stream   | binaryfiles   | lazy-loading   2016-09-09 16:09 4 Answers

Answers ( 4 )

  1. 2016-09-09 16:09

    I don't know how performant this is, but you can use java.io.SequenceInputStream wrapped inside of DataInputStream. This will effectively concatenate your files together. If you create a BufferedInputStream from each file, then the whole thing should be properly buffered.

  2. 2016-09-09 17:09

    There is no laziness possible for the reading within the a file as you are reading the entire file for constructing an instance of FileDataMetaData. You would need a substantial refactoring of that class to be able to construct an instance of FileDataMetaData without having to read the entire file.

    However, there are several things to clean up in that code, even specific to Java 7 rather than Java 8, i.e you don’t need a RandomAccessFile detour to open a channel anymore and there is try-with-resources to ensure proper closing. Note further that you usage of memory mapping makes no sense. When copy the entire contents into a heap ByteBuffer after mapping the file, there is nothing lazy about it. It’s exactly the same what happens, when call read with a heap ByteBuffer on a channel, except that the JRE can reuse buffers in the read case.

    In order to allow the system to manage the pages, you have to read from the mapped byte buffer. Depending on the system, this might still not be better than repeatedly reading small chunks into a heap byte buffer.

    public FileDataMetaData readRawFile(
        File file, int numDataPoints, int dataPacketSize) throws IOException {
    
        try(FileChannel fileChannel=FileChannel.open(file.toPath(), StandardOpenOption.READ)) {
            long fileSz = fileChannel.size();
            MappedByteBuffer bbRead=fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);
            List<DataPacket> dataPacketList = new ArrayList<>();
            while(bbRead.hasRemaining()) {
                int channelId = bbRead.getInt();
                long timestamp = bbRead.getLong();
                int[] data = new int[numDataPoints];
                for (int i=0; i<numDataPoints; i++) 
                    data[i] = bbRead.getInt();
                dataPacketList.add(new DataPacket(channelId, timestamp, data));
            }
            return new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);
        } catch (IOException e) {
            logger.error("", e);
            throw e;
        }
    }
    

    Building a Stream based on this method is straight-forward, only the checked exception has to be handled:

    public Stream<FileDataMetaData> readBinaryFiles(
        List<File> files, int numDataPoints, int dataPacketSize) throws IOException {
        return files.stream().map(f -> {
            try {
                return readRawFile(f, numDataPoints, dataPacketSize);
            } catch (IOException e) {
                logger.error("", e);
                throw new UncheckedIOException(e);
            }
        });
    }
    
  3. 2016-09-09 17:09

    Building on VGR's comment, I think his basic solution of:

    return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize))
    

    is correct, in that it will lazily process the files (and stop if a short-circuiting terminal action is invoked off the result of the map() operation. I would also suggest a slightly different to the implementation of readRawFile that leverages try with resources and InputStream, which will not load the whole file into memory:

    public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize)
      throws DataPacketReadException { // <- Custom unchecked exception, nested for class
    
      FileDataMetadata results = null;
    
      try (FileInputStream fileInput = new FileInputStream(file)) {
        String filePath = file.getCanonicalPath();
        long fileSize = fileInput.getChannel().size()
    
        DataInputStream dataInput = new DataInputStream(new BufferedInputStream(fileInput);
    
        results = new FileDataMetadata(
          filePath, 
          fileSize,
          dataPacketsFrom(dataInput, numDataPoints, dataPacketSize, filePath);
      }
    
      return results;
    }
    
    private List<DataPacket> dataPacketsFrom(DataInputStream dataInput, int numDataPoints, int dataPacketSize, String filePath)
        throws DataPacketReadException { 
    
      List<DataPacket> packets = new 
      while (dataInput.available() > 0) {
        try {
          // Logic to assemble DataPacket
        }
        catch (EOFException e) {
          throw new DataPacketReadException("Unexpected EOF on file: " + filePath, e);
        }
        catch (IOException e) {
          throw new DataPacketReadException("Unexpected I/O exception on file: " + filePath, e);
        }
      }
    
      return packets;
    }
    

    This should reduce the amount of code, and make sure that your files get closed on error.

  4. 2016-09-16 13:09

    This should be sufficient:

    return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize));
    

    …if, that is, you are willing to remove throws IOException from the readRawFile method’s signature. You could have that method catch IOException internally and wrap it in an UncheckedIOException. (The problem with deferred execution is that the exceptions also need to be deferred.)

◀ Go back