Allowing Search Engines to Crawl A Site

Previous Entry

In the previous post, we discussed the scale of the internet, the three Vs, the role of search engines and why SEO is so important in the modern world.

If you haven’t read this post, you can do so here. The scale of the internet.

In this post, we will explore how search engines can understand the content of a website and the steps to make this easier for them.

How does a search engine crawl a website?

Major search engines such as Google and Bing will have their implementation of crawlers (also referred to as spiders or bots). These robots will visit pages on a website, download a copy of the markup and will process them. 

A crawler will process several elements from a website. These include, but are not limited to:

  • The title of the page.
  • The Description of a page.
  • Any links and anchors on a page, both internal and external.
  • The content of headings.
  • The content of the main text.
  • Images and alternative text provided for those images.

Each bot will have its design that is largely hidden from the public and therefore might favour different properties of the markup more than others.

Depending on some other factors that will be visited in another post, the spider will revisit these pages periodically to check for updates.

Checking if a web page is already indexed

Many popular search engines offer tools to check what information they have about a website and allow you to provide extra information to them.

One example of this is Google Search Console which reveals what URLs (and content) Google is aware of, which ones they have ignored, how many impressions they have made and much more. Bing also has a similar feature called Bing webmaster tools.

Google search console

It is also possible to use advanced operators such as “site:” to see if a search engine returns results for a specific site. For example, if we wanted to see which search results appear for “edwinlangley.co.uk”, we could search for “site:edwinlangley.co.uk”. This will also work for image results.

Google site results

Using “site:” in the same manner also works for Bing:

Bing site results

Robots.txt

Before a search engine robot visits your site, it first checks to see if it can find a robots.txt file at the root of a site. This file is important as it instructs robots on what parts of a website they can crawl and how they must go about it. 

The Robots.txt files can contain many properties, outlined by the Robots Exclusion Standard, these include the target user agent, the disallowed pages for that user agent, the allowed pages, crawl delay (in milliseconds) for the robot and the location of the sitemap.xml which will be covered later.

An example of a Robots.txt that would allow all user-agents to crawl the entire website would be. 


User-agent: *
Disallow:

Sitemap: https://yourdomain.com/sitemap.xml

                

The above file users a wildcard “*” to denote all crawlers and directs them that no entries are disallowed.