What is robots.txt: the basics for newbies
What is robots.txt: the basics for newbies
Successful indexing of a new site depends on many components. One of them is the robots.txt file, which any novice webmaster should be familiar with correctly. Updated material for beginners.
Learn more about the rules for composing a file in the complete guide ” How to compose robots.txt yourself” .
And in this material, the basics are for beginners who want to keep up with professional terms.
What is robots.txt
A robots.txt file is a .txt document containing instructions for indexing a specific site for search bots. It tells search engines which pages of a web resource should be indexed and which should not be allowed to be indexed.
A search robot, having come to your site, first of all tries to find robots.txt. If the robot does not find the file or it is compiled incorrectly, the bot will explore the site at its own discretion. It is far from the fact that he will start with the pages that need to be entered into the search in the first place (new articles, reviews, photo reports, and so on). Indexing a new site may take a long time. Therefore, the webmaster needs to take care of creating the correct robots.txt file in time.
On some website builders, the file is generated by itself. For example, Wix automatically generates robots.txt. To view the file, add “/robots.txt” to the domain. If you see strange elements like “noflashhtml” and “backhtml” in there, don’t be alarmed: they are related to the structure of the sites on the platform and do not affect the attitude of search engines.
Why robots.txt is needed
It would seem, why prohibit the indexing of some site content? Not all of the content that makes up the site is needed by search robots. There are system files, there are duplicate pages, there are headings of keywords and a lot of other things that do not have to be indexed. There is one thing:The contents of the robots.txt file are
guidelines for bots, not hard and fast rules. Bots can ignore recommendations.
Google warns that you cannot block pages from showing on Google through robots.txt. Even if you block access to the page in robots.txt, if on some other page there is a link to this page, it can get into the index. It is better to use both robots restrictions and other blocking methods:
Prohibition of site indexing, Yandex
Blocking indexing, Google
However, without robots.txt, it is more likely that information that should be hidden will end up in the search results, and this can be fraught with the disclosure of personal data and other problems.
What robots.txt consists of
The file should only be named “robots.txt” in lowercase letters and nothing else. It is placed in the root directory – https://site.com/robots.txt in a single copy. In response to the request, it should return an HTTP code with a 200 OK status. The file size must not exceed 32 KB. This is the maximum that Yandex will accept, for Google robots it can weigh up to 500 KB.
Everything inside must be in Latin, all Russian names must be translated using any Punycode converter. Each URL prefix must be written on a separate line.
In robots.txt directives (commands or instructions) are written using special terms. Briefly about directives for search bots:
“Us-agent:” – the main directive robots.txt
Used to concretize the search robot that will be given instructions. For example, User-agent: Googlebot or User-agent: Yandex.
In the robots.txt file, you can refer to all other search engines at once. The command in this case will look like this: User-agent: * . The special character “*” is usually understood as “any text”.
After the main directive “User-agent:” specific commands follow.
“Disallow:” command – prohibit indexing in robots.txt
Using this command, a search robot can be prohibited from indexing a web resource in whole or in some part of it. It all depends on what kind of extension she will have.
This kind of entry in the robots.txt file means that the Yandex search robot is not allowed to index this site at all, since the prohibition sign “/” is not accompanied by any clarifications.
Disallow: / wp-admin
This time there are clarifications and they concern the wp-admin system folder in the CMS WordPress . That is, the indexing robot is advised to abandon indexing this entire folder.
Command “Allow:” – allow indexing in robots.txt
Antipode to the previous directive. Using the same qualifying elements, but using this command in the robots.txt file, you can allow the crawler robot to add the site elements you need to the search base.
Allow: / catalog
Everything that starts with “/ catalog” is allowed to scan, and everything else is prohibited.
In practice, “Allow:” is not used very often. There is no need for it because it is applied automatically. In robots “everything is allowed that is not prohibited.” The site owner just needs to use the “Disallow:” directive, prohibiting indexing of some content, and all other content of the resource is perceived by the search robot as available for indexing.
Directive “Sitemap:” – an indication of the sitemap
” Sitemap: ” tells the crawling robot the correct path to both the Sitemap – sitemap.xml and sitemap.xml.gz files in the case of the WordPress CMS.
Writing the command in the robots.txt file will help the search robot to index the Sitemap faster. This will speed up the process of getting resource pages into the search results.
file ready – what’s next
So you’ve created a robots.txt text document based on your site’s needs. It can be done automatically, for example, using our tool .
What to do next:
- check the correctness of the created document, for example, using the Yandex service ;
- using an FTP client, upload the finished file to the root folder of your site. In a WordPress situation, this is usually the Public_html system folder.
Then you just have to wait for the search robots to appear, examine your robots.txt, and then start indexing your site.
How to view the robots.txt of someone else’s site
If you are interested in first looking at ready-made examples of the file performed by others, then there is nothing easier. To do this, in the address bar of the browser, just enter seonewsjournal-com/robots.txt . Instead of “seonewsjournal.com” – the name of the resource you are interested in.