Yes, Content Auditor can scrape content from a website.
You can download your content in CSV format by clicking the "download" icon on your sites list, then selecting the "content" option. For large sites it will take a while for Content Auditor to compile the content for download, so please be patient.
Currently this "download content" feature has some shortcomings:
- whitespace is stripped
- text formatting is stripped
- unwanted HTML elements in content
- Excel limitations
Whitespace is stripped
While spaces within sentences remain intact, spaces between structural elements do not. This can causes issues where words come together without a space to separate them.
Text formatting is stripped
The CSV format we're using for data downloads doesn't support the use of rich text formatting. This means that we lose any bold, italic, or other styles that help separate content into chunks. We also lose other structural formatting such as headings, paragraphs, and lists.
Unwanted HTML elements in content
While stripping unwanted HTML elements from your scraped content is a good thing. However, our current implementation contains some errors. For example, Content Auditor currently leaves in <!-- HTML comments --> and some less common HTML elements such as <video>.
This is only an issue if you have page content that's longer than 35,000 characters. If you use Microsoft Excel to view the CSV, you may run across this issue. Excel has a limit of about 35,000 characters per cell, which means that if your content is longer than that your data will "get weird". When a cell contains too many characters, Excel tries to place the extra characters into a new cell. New cells created this way aren't formatted properly. The result is a messy spreadsheet with random inconsistencies in formatting. The simplest solution to this issue is to use a different spreadsheet program. My personal preference is OpenOffice, which is entirely free.
What we're planning to do
Our vision is to provide content scraped from your site in a useful format. We've started experimenting with moving from the .CSV format to .ODS, which would allow us to use rich text formatting. We also plan to improve our treatment of whitespace and HTML elements.