DocketScope

Attacking Problem Attachments in DocketScope

Every public comments analysis project will have to address its share of digital attachments (e.g., PDFs, Word files, or images) to comment submissions. These attachments come in many file formats, can be very long, and can have complicated formatting issues. DocketScope® software has features that will automatically solve most problems with attachments, but a savvy user will have a few additional options to address problem attachments.

When DocketScope receives a comment submission, either through its Regulations.gov interface, its FDMS extract loader, or another data source, it automatically processes the submission’s attachments. While most, though not all, of these attachments are PDF files, they can be generated in several ways, including from other digital files, photographs, and scans of existing documents. DocketScope’s technology intelligently converts the content into HTML, preserving formatting and document layout, allowing a user to easily highlight specific citations from the document and map these citations to issues. This processing also enables DocketScope’s clustering of duplicate and similar comment submissions discussed in an earlier blog post. In most cases, this process works well, but every now and then, an attachment will not be converted properly, its layout may impede the conversion to HTML, or it may simply be too long to process. In these edge cases, here are some options for the analyst.

Issue #1: Unreadable PDF Attachments.

DocketScope generally makes most attachments easy to read and map, but there are edge cases where the results can be improved. For example, an attachment could be created through multiple steps, such as a commenter typing out a sheet of paper and taking a photo of this paper. Even the most up-to-date OCR (Optical Character Recognition) technology can sometimes have trouble reading the rare attachments generated in an idiosyncratic way. Until Docketscope recognizes the text, the reviewer cannot easily cite the text. Here are a couple of ideas that may help.

  • Solution #1: Try Other Conversions. 
    DocketScope allows users to download and upload comment attachments. A user can try a different OCR process by clicking the Attachments tab, downloading the comment’s attachment, using an external tool to process the attachment, then uploading a text-based PDF version of the document back to DocketScope using the Attachments tab. Microsoft Word and Adobe Acrobat Pro both have OCR capabilities that invoked when they are used to open image-based PDF files. There are also several online options for converting more complex documents.
  • Solution #2 Divide Long Documents.
    Conversion processes often do better with shorter documents. Dividing a longer document into sections, either based on the document logic or a specific number of pages, may improve conversion results.
  • Solution #3: Use the Scratchpad.
    If the automated conversions fail, a user can click at the top right corner of the Comment & Review tab to manually add information. Generally, the automated conversions work, but this approach provides backstop that always works!

Issue #2: Heavily Mapped Attachments.

Frequently, the most significant and technical rulemakings prompt lengthy and in-depth comments. DocketScope typically handles large documents, but an especially long and substantive comment attachment can begin to run more slowly as the user creates more and more citations referencing the single attachment.

  • Solution #1: Duplicate the Attachment.
    When a user notices that processing has slowed because of many mappings within a given attachment, they can go to the attachments tab and re-upload a second copy of the same comment. They can then return to the Text & Review page and begin mapping the new copy of the attachment. The reviewer will once again be able to map new comments quickly on the new version of the attachment.
  • Solution #2: Divide Long Attachments.
    When a user knows a document will have many citations, it may help to download the document, divide it into smaller documents, then upload and work to map these documents. This works similarly to the duplication solution but allows easier planning and helps prevent doubling work.

Issue #3: Incorrect Formatting, Headers, and Footers.

A well-researched attachment may provide footnotes to support the text of its comment. A commenting organization also often includes its name, contact information, and a page number at the top or bottom of an attachment page. Additionally, sometimes the formatting of an attachment can make the attachment slightly difficult to read. While these issues do not cause major problems, the reviewer will likely wish to minimize them to present the information in the most easily readable and visually appealing report possible.

  • Solution 1#: Modify the Citation.
    A user can double-right click a mapped citation in the document. The user can then modify the text as it will appear in DocketScope reports. Using this function, the user can move a footnote to the bottom of a citation, delete unneeded information such as page numbers and company information in headers or footers, and modify the formatting of the citation, improving the readability of the eventual report.
  • Solution #2: Download and Clean the Attachment.
    A user could also consider downloading an attachment from the Attachment tab and use Word or Acrobat to clean the attachment before mapping. If the user opts for this solution, they should be careful to make sure no important information is lost in the process.

Reviewing and citing the information presented in attachments is a major component of analyzing comments on a proposed rule or other government action. DocketScope software automatically does much of this work, but resourceful reviewers can improve the process further by familiarizing themselves with the Attachments Tab and other tools used to map attachments.