I recently started a major project of performing OCR scanning of the documents for SFI AJS37 (parts 1, 2 and 3). It is a lot of work and takes a long time, but will be very rewarding in the end. Truth be told, I actually enjoy the process.
The main goal of this project is to make SFI AJS37 available for download in modern digital formats with wide compatibility. The source documents are PDF files containing photocopies of the paper originals. This in itself must have been a major undertaking. It gave us Viggen enthusiasts something truly unique to read and study in-depth. The disadvantage to these PDF documents is that the text cannot be copied or edited, as the pages are made up of images, not text.
Therefore, I started my journey to convert the documents into various digital formats. These documents can be copied, edited, saved in other formats and can be adapted to your needs. Sharing with other enthusiasts is also encouraged.
You can follow the creation progress here.
To aid in availability, the document formats are made to be compatible with commonly used word processors such as Microsoft Word and Libre Office. The formats should also be operating system agnostic as far as possible, i.e. they should be able to be opened on Windows, Mac and Linux. I encourage the concept of Open Source and the documents should be highly compatible with these types of word processor suites.
Document formats and compatibility
In order to achieve the best overall compatibility, the documents are standardised. It means that they start from a “baseline” using the lowest common denominator. For example, common fonts such as Arial, Times New Roman and Verdana are used as much as possible. This is the best way to ensure that documents look identical across different applications, while still staying true to the original look.
The base document format used is .docx (Microsoft Word) which is then exported to other formats and tested on the various platforms.
The documents are available in, and compatible with, these word processors and formats:
As a bonus, some of the documents are available in Google Docs format as well.
The documents are created to look as similar as possible to the originals in terms of design, layout, fonts, etc. As mentioned, this is the main priority. Because of this, there are no clickable table of contents, or automated page numbering. However, in Microsoft Word, clickable collapsible headings are available (.docx).
OCR text conversion
To be able to transform the static PDF images into text in a reasonable manner, OCR scanning is performed using the free and amazing application ShareX. See this article for more details on how this is done.
- Windows 10 Pro, version 20H2
- Microsoft Word for Microsoft 365, 64-bit, Windows
- Libre Office, Windows/Mac/Linux
- ShareX, Windows