Garbage when loading xml content with URLConnection

Question

I'm trying to load the content of an XML page using URLConnection but I'm getting back garbage characters. The same code works for me for pretty much any other site so I'm not sure what's the issue.

Here's the relevant code:

String url = "http://myUrl";
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setConnectTimeout(60*2000); // wait only 60 seconds for a response
conn.setReadTimeout(60*2000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
    wholeDocument += inputLine;     
}       

Printing out wholeDocument produces a bunch of characters like this: er���;�pI.���$6

I am using encoding = 'UTF-8'.

I also tried using XML libraries, for example:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL(baseUrl).openStream());
System.out.println("doc = " + doc);

But the result is the same. When using curl in a terminal app (I'm on a mac) the result is similar although the characters look like this: ???0??KZV??????0N6?aH:$?X9v???$>???`

Any idea how to solve this?


Show source
| java   | xml   | web-crawler   2016-08-21 14:08 2 Answers

Answers to Garbage when loading xml content with URLConnection ( 2 )

  1. 2016-08-21 14:08

    If you check the headers of your response you will see Content-Encoding: gzip indicating that the body of the response has been compressed, you need to uncompress it first, that's why you get those weird characters. More details about Http Compression.

    A good way to check the headers with curl is to use the verbose option -v, In this case thanks to curl -v http://sites.one.co.il/XML/VOD/ | more, I could quickly see the response headers.

  2. 2016-08-21 14:08

    Expanding on the other answer, you can check if the received file is gzip encoded, and decode it if so by:

     if (conn.getHeaderField("Content-Encoding") != null && 
            conn.getHeaderField("Content-Encoding").equals("gzip")){
        InputStream gzStream = new GZIPInputStream(conn.getInputStream());
        InputStreamReader isr = new InputStreamReader(gzStream, encoding);
    } else {
        InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
    }
    

    Alternatively, you can specify that you wouldn't like gzip encoded data by:

    conn.setRequestProperty("Accept-Encoding", "identity"); 
    

Leave a reply to - Garbage when loading xml content with URLConnection

◀ Go back