Angular 4 Crawlability

I'm working on a Angular 4 landing page. Here is some info: URL's don't have hashbang (example: www.something.com/about) Meta tag used on head: <meta name="fragment" content="!"> We are using prerender with nginx proxying _escaped_fragment_ t...
more »

2017-07-20 15:07 (0) Answers

How to follow external links with gem anemone?

I need help in creating a method to follow external links with gem anemone and check if they are broken. I have tried other gems like link-checker and metainspector unsuccessfully. Please, any suggestions would be great. Thank you. root = args[:...
more »

2017-05-25 00:05 (0) Answers

Google crawler and dynamic websocket content

I have a a big problem that i'm only considering after I have finished development of a webpage. My page is written with only a skeleton html. All the actual content of my site is sent through websockets. the client side javascript then captures this...
more »

2017-05-20 13:05 (0) Answers

Can social networks run JavaScript when indexing?

For a few years, Google crawlers can run JavaScript in SPA websites in order to index the content of the website. But social networks (like Twitter, Facebook, et cetera) do not. Incidentally I saw this website that uses AngularJS (version 1.x, so th...
more »

2017-04-23 15:04 (1) Answers

python_" .txt" files can not be created

" .txt " files can not be created. The code has been created, but the file is not created. I've been advised to use " pickle ". But I don't know how to use " pickle. " How can I use this code to save it as a file Also, I would like to place the ...
more »

2017-04-22 04:04 (0) Answers

Replacing Google Site Search with AWS Cloudsearch

So I'm working on a site that has pretty specific global site search functionality that utilizes GSS which, as many of you already know, is going away in April. I need to crawl the site and send XML over to Cloudsearch, but I'm kind of confused as to...
more »

2017-03-02 21:03 (1) Answers

Combined fields into array

I have a code to crawler data- $doc = new DOMDocument(); $internalErrors = libxml_use_internal_errors(true); $doc->loadHTMLFile($url); // Restore error level libxml_use_internal_errors($internalErrors); $xpath = new DOMXpath($doc); $result=arra...
more »

2017-02-28 07:02 (2) Answers

What are the alternatives to Angular Universal?

I'm creating an Angular 2 app and I'm having problems with Google Crawler. The pages are not being indexed. Angular Universal promises to solve that but it's not supported by some components I'm using so I'm looking for an alternative. What are the...
more »

2017-02-25 17:02 (0) Answers

How to read a sitemap and its directories?

I am building a web-crawler for this particular site http://www.dictionary.com And after checking robots.txt User-agent: * Disallow: /site= Disallow: /5480.iac. Disallow: /go/ Disallow: /audio.html/ Disallow: /houseads/ Disallow: /askhome/ Di...
more »

2017-02-18 17:02 (1) Answers

classified ads website google indexing

we have a large classified ads website we have good sitemap and robots.txt we have good content google crawl us 4 times per second buuuuuuut.... we dont get index in google and our new advertisement post appears in serp after 2 or 3 days what should...
more »

2017-01-17 08:01 (0) Answers

Truncated pages getting indexed in Google

We have recently noticed that some of the pages of our website in google index are truncated. We initially though this might me because of some sort of timeout being appplied by the web server, or may be an abrupt break in the socket connection. This...
more »

2017-01-16 05:01 (0) Answers

Googlebot not respecting http basic auth

I have basic auth set up and it has always worked. Suddenly google started crawling my pages. The auth is still there (I have checked it using different browsers). I am at a loss how it's possible. The user/pass is dead simple to guess from the ur...
more »

2017-01-03 22:01 (0) Answers

disallow some image folders

I am making my robots.txt file. But I am a little bit insecure about how to make disallow Googlebot-Image. I want to allow the Google bot to crawl my site, except for the disallow I have made below. This is what I made: User-agent: Googlebot Disallo...
more »

2016-12-30 18:12 (1) Answers

h1 tags present but not showing on crawl

Been working on a website lately it's suspected that the HTML is causing H1 and H2 tags not being recognised when crawled. The website is http://houseoftravel.co.nz. Just wondering if anybody can recognise the issues? ...
more »

2016-12-21 23:12 (0) Answers

Web Crawler problems in Python

I've been working on creating a single-threaded Web Crawler in Python that will group the assets of each page and output a JSON array of the form: [ { url: 'http://url.com/', assets: [ 'http://url.com/imgs/img1.jpg', 'http://ur...
more »

2016-12-12 21:12 (0) Answers

Can sharepoint crawl url which contains hashtag #

I have an external site on sharepoint search content source. This is a single page application site and pages url's has # like this. http://yoursite.com/content#/... Sharepoint search cannot crawl these pages. How can I crawl these pages? I set to...
more »

2016-12-06 08:12 (0) Answers