Read large mongodb data

Question

I have a java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.

This batch application is run every 4 hours 6 times a day.

Data Specifications:

  • Documents: 80000 at a time (every 4 hours)
  • Size : 3gb

Currently I am using MongoTemplate and Morphia in order to access MongoDB. However I get an OOM exception when processing this data using the following :

List<MYClass> datalist = datasource.getCollection("mycollection").find().asList();

What is the best way to read this data and populate to Hadoop?

  • MongoTemplate::Stream() and write to Hadoop one by one?
  • batchSize(someLimit) and write the entire batch to Hadoop?
  • Cursor.batch() and write to hdfs one by one?

Show source
| java   | hadoop   | mongodb   | morphia   2017-09-28 11:09 1 Answers

Answers to Read large mongodb data ( 1 )

  1. 2017-09-28 11:09

    Your problem lies at the asList() call

    This forces the driver to iterate through the entire cursor (80,000 docs few Gigs), keeping all in memory.

    batchSize(someLimit) and Cursor.batch() want help here as you traverse the whole cursor, no matter what batch size is.

    Instead you can:

    1) Iterate the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()

    2) Read documents one at a time and feed the documents into a buffer (let's say a list)

    3) For every 1000 documents (say) call Hadoop API, clear the buffer, then start again.

Leave a reply to - Read large mongodb data

◀ Go back