Google searching in Java

Question

This program reads takes a text file of search queries, queries Google with them, and outputs all of the links to another file. The program works for a few hundred queries, but suddenly working and reports an error.

(I will edit this post and post what errors are being returned from which lines of my program soon).

Any ideas what might be happening?

import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Scanner;

public class GoogleSearcher {
  public static void main(String [] args) throws Exception {
    Scanner in = new Scanner (System.in);
    System.out.println("Input list of queries to search:");
    String loc = in.nextLine();
    loc = loc.replace("\\", "");
    System.out.println("Where to write file?");
    String writeLoc = in.nextLine();
    writeLoc = writeLoc.replace("\\", " ");
    FileInputStream fstream = new FileInputStream(loc);
    BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
    String line;
    PrintWriter pw = new PrintWriter(new FileWriter(writeLoc + "Google Search Results.txt"));
    while ((line = br.readLine()) != null) {
      System.out.println("Searching: \"" + line + "\"");
      ArrayList<String> t = googleSearch(line);
      if (t != null){
        for (int a = 0; a < t.size(); a++){
          pw.write(t.get(a) + System.lineSeparator());
        }
      }
    }
    br.close();
    pw.close();
  }
  public static ArrayList<String> googleSearch(String search) throws Exception {
    try {
      String query = "https://www.google.com/search?q=" + search.replace(" ", "%20");
      String page = getSearchContent(query);
      ArrayList<String> links = parseLinks(page);
      return formatLinks(links);
    } catch (Exception e) { 
      e.printStackTrace();
      System.out.println("Error... Trying next search");
      return null;
    } 
  }
  public static ArrayList<String> formatLinks(ArrayList a){
    ArrayList<String> formatted = new ArrayList<String>();
    for (int i = 0; i < a.size(); i++){
      String t = (String)a.get(i);
      t = t.replace("%3F", "?");
      t = t.replace("%3D", "=");
      formatted.add(t);
    }
    return formatted;
  }
  public static String getString(InputStream is) {
    StringBuilder sb = new StringBuilder();
    BufferedReader br = new BufferedReader(new InputStreamReader(is));
    String line;
    try {
      while ((line = br.readLine()) != null) {
        sb.append(line);
      }
    } catch (IOException e) {
      e.printStackTrace();
    } finally {
      if (br != null) {
        try {
          br.close();
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    }
    return sb.toString();
  }
  public static String getSearchContent(String path) throws Exception {
    final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
    URL url = new URL(path);
    final URLConnection connection = url.openConnection();
    connection.setRequestProperty("User-Agent", agent);
    final InputStream stream = connection.getInputStream();
    return getString(stream);
  }
  public static ArrayList<String> parseLinks(final String html) throws Exception {
    ArrayList<String> result = new ArrayList<String>();
    String pattern1 = "<h3 class=\"r\"><a href=\"/url?q=";
    String pattern2 = "\">";
    Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
    Matcher m = p.matcher(html);
    while (m.find()) {
      String domainName = m.group(0).trim();
      // remove unwanted text
      domainName = domainName.substring(domainName.indexOf("/url?q=") + 7);
      domainName = domainName.substring(0, domainName.indexOf("&amp;"));
      result.add(domainName);
    }
    return result;
  }
}

Show source
| java   | search   2017-01-03 07:01 2 Answers

Answers ( 2 )

  1. 2017-01-03 07:01

    Okay, after running several rounds of your program, I got the following error.

    Error... Trying next search
    Searching: "autoradiograph"
    java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Daustria&q=EgTLe7ahGOKSrcMFIhkA8aeDSylzciRE9l0cz9fUg6u2MeGh-muxMgNyY24
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1876)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
        at application.GoogleSearcher.getSearchContent(GoogleSearcher.java:90)
        at application.GoogleSearcher.googleSearch(GoogleSearcher.java:45)
        at application.GoogleSearcher.main(GoogleSearcher.java:32)
    java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dautoradiograph&q=EgTLe7ahGOKSrcMFIhkA8aeDS_cQehdQreptc4cInLKEPYpprweeMgNyY24
    

    This is happening, because google is blocking automated searches to prevent a Denial of Service attack on their server.

    Google Captcha Image

    Google might not allow you to perform automated searches. Here's a link to their support page.. Here's an extract from that page.

    Automated queries

    Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

  2. 2017-01-03 08:01

    That's because it's designed in that way. Whenever Google detects that some sort of automated software is fetching it's results, it will ask for human verification and shows a CAPTCHA.

    See this answer from support.google.com.

    "Unusual traffic from your computer network"

    You might see "Our systems have detected unusual traffic from your computer network" if it seems like a computer or phone on your network is sending automated traffic to Google.

    What Google considers automated traffic

    • Sending searches from a robot, computer program, automated service, or search scraper
    • Using software that sends searches to Google to see how a website or webpage ranks on Google

    What to do when you see this message

    The error page most likely shows a CAPTCHA (a squiggly word with a box below it). To continue using Google, type the squiggly word into the box. It's how we know you're a human, not a robot. After you type the CAPTCHA correctly, the message will go away and you can use Google again.


    If you want to use google search in your website, then you can use Google Custom Search that's created for this purpose only.

    See also: Add search to your site

◀ Go back