Post by Anthony DeMarco, Junior Copywriter
Google’s algorithm assesses meaning from webpages by accessing certain data points via the HTML structure such as META data, text on page, and hyperlink anchor text. However, we know from a September 2011 Google Webmaster Central Blog that Google also indexes PDF files. The author discloses several important facts in the blog post:
- Plain text is the preferable format for PDF files
- If a PDF is not in plain text, Google will use OCR algorithms to convert the image into text
- Hyperlinks in PDFs are, generally, treated the same as hyperlinks in HTML
- PDFs cannot be crawled if the file is password protected or encrypted
- Google extracts META data contained within the PDF to inform its indexing
Given that Google has been indexing PDF files since 2001 and can extract large amounts of data from them, it is worth considering the pros and cons of utilizing PDFs on a website:
- Easy to Create: PDFs are a file format that are easy to create for individuals unfamiliar with HTML.
- Long Tail Keywords: According to a 2013 Search Engine Journal article, in narrow and less competitive search environments, organic long tail keywords within the copy of the PDF can propel it to high rankings on the Search Engine Results Page (SERP).
- Time and Associated Cost: Reconstructing, optimizing and integrating the HTML data from a PDF file is a labor-intensive process. This should be taken into consideration when a large number of conversion efforts would be necessary.
- Larger SEO Task: Converting PDF files to HTML can create duplicate content for a website that needs to be addressed with SEO measures such as canonical links.
- Lack of Integration: When a visitor is reading a PDF, they are not engaging with the site. This can affect a number of metrics, including conversion rates. Additionally, PDFs lack easy navigation back to the website and launching a PDF may disengage the back button on the browser.
- Specificity: Often times larger PDF files cover several topics and the nature of the format prevents the file from being broken up. This can hinder SEO optimizations that rely on specificity.
- Loss of Structured Data: Structured Data cannot be applied to PDF files. The PDF loses the opportunity to generate Rich Snippets and the accompanying SEO benefits.
- Lack of Control: Content Management Systems often treat PDF files as downloads rather than pages, making it more difficult to organize and control a website.
- Loss of Audience: If the visitor does not have an up-to-date PDF viewer, it is possible that he/she will not commit to an extra download to access client content.
- Potential Errors: If the PDF file is not in Plain Text format, the OCR extracted data may contain errors. OCR technology is not 100% accurate. This can damage the SEO value of organic keywords and hyperlinks.
For SEO purposes, the advantages of PDF conversion to HTML outweigh the disadvantages. Additionally, content in HTML format is more malleable. It is easier to optimize and update than recreating an entire PDF.
The major deciding factor is the scale of the project. For a small number of PDF files the SEO advantages of converting are considerable versus the labor involved in the process. However, if a website makes use of a large number of PDFs (e.g., a scholarly journal that uses its website to distribute 1,000s of academic articles in PDF format) the decision would have to be more nuanced. A calculation of the cost of labor to perform the process would need to be made against the possible increase in traffic/revenue, and such an analysis is likely to reveal the cost outweighs the benefits.
Optimizing a PDF
If the cost involved prevents conversion, there are still a number of steps that can be taken to optimize PDFs:
- Assign META data: <TITLE> Tags and META descriptions can both be assigned within the PDF file. Follow normal SEO best practices. Information on how to assign META data can be found on the Adobe website.
- Assign the Author: Though there is not conclusive research as to if this affects ranking, assigning the client as the author of the PDF can assist in branding efforts.
- Reduce the File Size: Larger PDF files can create longer load times, creating an opportunity for the visitor to abandon the content. Techniques for reducing file size can be found on the Adobe website.
- Assign Image Tags: Alt Image Tags can be assigned to images. Instructions can be found on the Adobe website.
- Avoid Duplicate Content: Apply standard SEO practices to avoid duplicating any content that may be on the client’s website.
- Write Protect the Document: PDFs that are not write protected can be redistributed with altered content.
- Include Backlinks: Whenever possible, include a backlink to the client’s website so that both the visitor and web crawlers have a link to the client website.
To learn more about PDFs to HTML conversion, contact Performics today.