The internet consist of a massive number of websites and only continues to grow at a phenomenal rate. It can be overwhelming for business owners to even begin to think about getting their pages both indexed but noticed in such a crowded web universe. So what are these website owners to do?
The answer is to focus on coming up on search engines through building good SEO, with the help of robots.txt and an XML-based Sitemap. If we’ve lost you already, don’t worry, we’ll break down everything for you in beginner-terms below.What is a Sitemap.xml?
A Sitemap.xml is not the same thing as what you probably think of when you hear the wird “sitemap.” A Sitemap file (using the capital S) is an XML-encoded listing of the key content files within a website that are built specifically for search engine crawlers to use as data. This stands in contrast to a traditional sitemap file (lower-case s) that is an HTML file that lists the content files within a site for human users to find the content they’re looking for in a site. The key difference is the intended audience (search engine vs. human) and the code.
Search engines use Sitemaps.xml to learn about a website and it’s structure. That means that if a Sitemap uses well-formed XML code, with clean and valid URLs in addition to meeting the other requirements of search engines, the URLs it contains will be considered by search engines for future crawling activity.
Really all you need to know is this simple formula: clean Sitemap.xml= happy robots= website indexed in search results
While well-executed Sitemaps are always good for websites to have, they are especially essential for the following websites:
- New websites
- Sites that use dynamic URLs for their content pages
- Sites with unorganized archived content
What is robot.txt?
Now onto robot.txt. Robots exclusion protocol (REP), or robots.txt, is a text file webmasters create to instruct robots– typically search engine robots– how to crawl and index their website pages. A robots.txt file is a publicly available file, meaning that anyone can see what sections a webmaster has blocked from search engines. Essentially robots.txt tells Googlebot and other crawlers what is and is not allowed to be crawled; while the noindex tag tells Google Search what is and is not allowed to be indexed and displayed in Google Search.
The Basics of robot.txt
- Robots with parameters “noindex, follow” are used to restrict crawling or indexation
- However malicious crawlers tend to ignore robots.txt so the above protocol is not a reliable security measure
- Only one “Disallow:” line is permitted for each URL
- The filename of robots.txt is case sensitive. Make sure you use “robots.txt”, not “Robots.txt.”
- Spacing is not an accepted way to separate query parameters. For example, “/category/ /product page” would not be honored by robots.txt.
Why robots.txt and Sitemap.xml Matter
Though hopefully you have a skilled webmaster, it is important for website owners to understand the basics of robots.txt and Sitemaps to ensure their website is SEO-friendly and as user-friendly to both humans and search engines alike. Afterall, search engines eat well-organized and indexed web content like food. The proper use of robot.txt and Sitemaps ensure your website works just as good as it looks.
So now that you have a better understanding of Sitemaps and robots.txt, you may be motivated to get your website in shape, but the next question is, how? You can either hire a webmaster to do this for you, or you can use a robots.txt and Sitemap.xml Management system to help you along the way. These management systems not only indiciate the type of sitemap (ie. RSS feed, Atom feed, text file, etc) but also give you information about the types of content on your site and the number of URLS that have been indexed by search engines. These systems will also notify you of errors relating to each sitemap you have previously submitted.