url response from website while crawling

I want to set the response when my website url submited to anyother website like facebook and facebook crawler collect a response from it. how can i set the response white i want to be display. image for showing example for facebook crawling like t...
more »

2017-09-21 20:09 (0) Answers

how to deal with captcha when web scraping using R

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha. this is the normal url: "https://fs.lianjia.com/ershoufang...
more »

2017-09-19 17:09 (1) Answers

Web Crawling with seed URLs from search engine

I need to know if it is worth to build a crawler on top of the results given by a search engine. By that means, for a given query, grab N URLs from a search engine and input them into a crawler to find more relevant pages to the search. Is there an...
more »

2017-08-20 16:08 (1) Answers

Angular 4 Crawlability

I'm working on a Angular 4 landing page. Here is some info: URL's don't have hashbang (example: www.something.com/about) Meta tag used on head: <meta name="fragment" content="!"> We are using prerender with nginx proxying _escaped_fragment_ t...
more »

2017-07-20 15:07 (0) Answers

How to follow external links with gem anemone?

I need help in creating a method to follow external links with gem anemone and check if they are broken. I have tried other gems like link-checker and metainspector unsuccessfully. Please, any suggestions would be great. Thank you. root = args[:...
more »

2017-05-25 00:05 (0) Answers

Google crawler and dynamic websocket content

I have a a big problem that i'm only considering after I have finished development of a webpage. My page is written with only a skeleton html. All the actual content of my site is sent through websockets. the client side javascript then captures this...
more »

2017-05-20 13:05 (0) Answers

Can social networks run JavaScript when indexing?

For a few years, Google crawlers can run JavaScript in SPA websites in order to index the content of the website. But social networks (like Twitter, Facebook, et cetera) do not. Incidentally I saw this website that uses AngularJS (version 1.x, so th...
more »

2017-04-23 15:04 (1) Answers

python_" .txt" files can not be created

" .txt " files can not be created. The code has been created, but the file is not created. I've been advised to use " pickle ". But I don't know how to use " pickle. " How can I use this code to save it as a file Also, I would like to place the ...
more »

2017-04-22 04:04 (0) Answers

Replacing Google Site Search with AWS Cloudsearch

So I'm working on a site that has pretty specific global site search functionality that utilizes GSS which, as many of you already know, is going away in April. I need to crawl the site and send XML over to Cloudsearch, but I'm kind of confused as to...
more »

2017-03-02 21:03 (1) Answers

Combined fields into array

I have a code to crawler data- $doc = new DOMDocument(); $internalErrors = libxml_use_internal_errors(true); $doc->loadHTMLFile($url); // Restore error level libxml_use_internal_errors($internalErrors); $xpath = new DOMXpath($doc); $result=arra...
more »

2017-02-28 07:02 (2) Answers

What are the alternatives to Angular Universal?

I'm creating an Angular 2 app and I'm having problems with Google Crawler. The pages are not being indexed. Angular Universal promises to solve that but it's not supported by some components I'm using so I'm looking for an alternative. What are the...
more »

2017-02-25 17:02 (0) Answers

How to read a sitemap and its directories?

I am building a web-crawler for this particular site http://www.dictionary.com And after checking robots.txt User-agent: * Disallow: /site= Disallow: /5480.iac. Disallow: /go/ Disallow: /audio.html/ Disallow: /houseads/ Disallow: /askhome/ Di...
more »

2017-02-18 17:02 (1) Answers