How to follow external links with gem anemone?

I need help in creating a method to follow external links with gem anemone and check if they are broken. I have tried other gems like link-checker and metainspector unsuccessfully. Please, any suggestions would be great. Thank you. root = args[:...
more »

2017-05-25 00:05 (0) Answers

Google crawler and dynamic websocket content

I have a a big problem that i'm only considering after I have finished development of a webpage. My page is written with only a skeleton html. All the actual content of my site is sent through websockets. the client side javascript then captures this...
more »

2017-05-20 13:05 (0) Answers

Can social networks run JavaScript when indexing?

For a few years, Google crawlers can run JavaScript in SPA websites in order to index the content of the website. But social networks (like Twitter, Facebook, et cetera) do not. Incidentally I saw this website that uses AngularJS (version 1.x, so th...
more »

2017-04-23 15:04 (1) Answers

python_" .txt" files can not be created

" .txt " files can not be created. The code has been created, but the file is not created. I've been advised to use " pickle ". But I don't know how to use " pickle. " How can I use this code to save it as a file Also, I would like to place the ...
more »

2017-04-22 04:04 (0) Answers

Replacing Google Site Search with AWS Cloudsearch

So I'm working on a site that has pretty specific global site search functionality that utilizes GSS which, as many of you already know, is going away in April. I need to crawl the site and send XML over to Cloudsearch, but I'm kind of confused as to...
more »

2017-03-02 21:03 (1) Answers

Combined fields into array

I have a code to crawler data- $doc = new DOMDocument(); $internalErrors = libxml_use_internal_errors(true); $doc->loadHTMLFile($url); // Restore error level libxml_use_internal_errors($internalErrors); $xpath = new DOMXpath($doc); $result=arra...
more »

2017-02-28 07:02 (2) Answers

What are the alternatives to Angular Universal?

I'm creating an Angular 2 app and I'm having problems with Google Crawler. The pages are not being indexed. Angular Universal promises to solve that but it's not supported by some components I'm using so I'm looking for an alternative. What are the...
more »

2017-02-25 17:02 (0) Answers

How to read a sitemap and its directories?

I am building a web-crawler for this particular site And after checking robots.txt User-agent: * Disallow: /site= Disallow: /5480.iac. Disallow: /go/ Disallow: /audio.html/ Disallow: /houseads/ Disallow: /askhome/ Di...
more »

2017-02-18 17:02 (1) Answers

classified ads website google indexing

we have a large classified ads website we have good sitemap and robots.txt we have good content google crawl us 4 times per second buuuuuuut.... we dont get index in google and our new advertisement post appears in serp after 2 or 3 days what should...
more »

2017-01-17 08:01 (0) Answers

Truncated pages getting indexed in Google

We have recently noticed that some of the pages of our website in google index are truncated. We initially though this might me because of some sort of timeout being appplied by the web server, or may be an abrupt break in the socket connection. This...
more »

2017-01-16 05:01 (0) Answers

Googlebot not respecting http basic auth

I have basic auth set up and it has always worked. Suddenly google started crawling my pages. The auth is still there (I have checked it using different browsers). I am at a loss how it's possible. The user/pass is dead simple to guess from the ur...
more »

2017-01-03 22:01 (0) Answers

disallow some image folders

I am making my robots.txt file. But I am a little bit insecure about how to make disallow Googlebot-Image. I want to allow the Google bot to crawl my site, except for the disallow I have made below. This is what I made: User-agent: Googlebot Disallo...
more »

2016-12-30 18:12 (1) Answers

h1 tags present but not showing on crawl

Been working on a website lately it's suspected that the HTML is causing H1 and H2 tags not being recognised when crawled. The website is Just wondering if anybody can recognise the issues? ...
more »

2016-12-21 23:12 (0) Answers

Web Crawler problems in Python

I've been working on creating a single-threaded Web Crawler in Python that will group the assets of each page and output a JSON array of the form: [ { url: '', assets: [ '', 'http://ur...
more »

2016-12-12 21:12 (0) Answers

Can sharepoint crawl url which contains hashtag #

I have an external site on sharepoint search content source. This is a single page application site and pages url's has # like this. Sharepoint search cannot crawl these pages. How can I crawl these pages? I set to...
more »

2016-12-06 08:12 (0) Answers

Do bots/spiders clone public git repositories?

I host a few public repositories on GitHub which occasionally receive clones according to traffic graphs. While I'd like to believe that many people are finding my code and downloading it, the nature of the code in some of them makes me suspect that ...
more »

2016-11-12 13:11 (1) Answers

Website backlink finder

i want to develop a tool for getting website back links, Can us suggest best approach or any API for helps to getting website back link please suggest, backend language c# if you have any idea please share thanks ...
more »

2016-09-27 15:09 (0) Answers

Multilingual website and bot detection

I have a website where I implement multilingual. I divide my languages per subdomains. // root domain => neutral language for bots On the subdomains, if a language cookie was not set, I...
more »

2016-09-22 18:09 (1) Answers

SEO for dynamic page content?

i have written a website that is build like a content management system. So every page (include the home page) are built by articles and as result: every page consists of these articles. When a page is loading all of these articles are loaded by an a...
more »

2016-07-29 10:07 (0) Answers

Saved web pages when opened shows nothing

I'm using python to crawl a webpage and save it. And the code works properly. But when I open the web page it just shows the website name i.e., and not the actual content. You can just go the website and save one of it's page...
more »

2016-07-06 10:07 (2) Answers