Overview

A brief snapshot of PDF Clown's architecture.

NOTE: This is a stub article to be refined and complete.

Anatomy

In order to render as rigorously and cleanly as possible the contents of PDF files, PDF Clown features a layered structure that progressively hides syntactic details and reveals semantic entities (pages, bookmarks, fonts and so on), allowing clients to narrow their access scope just as much as they need.

[Diagram: PDF Clown layered structure]

Here it is a class diagram representing the library's main entities and their relations: please note this is just a more detailed view of the same layers described above.

[Diagram: PDF Clown main classes]

The diagram below represents a Document object (the semantic root of a PDF file as modelled by PDF Clown) as it's viewed across the layers:

  1. Byte layer (raw PDF file): the dotted box inside the PDF file icon contains a sample data fragment that represents a Catalog Dictionary (root object);
  2. Token layer (lexical interpretation (parsing) of a PDF file): the bytes of the Catalog Dictionary are aggregated in atomic items (lexemes);
  3. Object layer (data structures emerging from token aggregation): an indirect object pattern is recognized, so that a PdfIndirectObject is instantiated to incapsulate the Catalog Dictionary data;
  4. File layer (higher syntactic representation of a PDF file in PDF Clown model): the PdfIndirectObject containing the Catalog Dictionary is arrayed among the others to represent the PDF file structure;
  5. Document layer (semantic representation of a PDF file in PDF Clown model): the Catalog Dictionary is encapsulated inside a Document object, which inherits from PdfObjectWrapper (bridge between the object layer and the document layer).
[Diagram: A Document object viewed across the layers]

Functionalities

Content manipulation

Here I'm describing just one of the functionalities supported by PDF Clown, so that you can better evaluate the richness and flexibility its architecture delivers. To get your hands on the practical use of this library, please see the Blog

Graphics contents manipulation can be performed at several levels, depending on your abstraction preferences:

  1. content stream (low level):
    1. unmanaged content stream: this is the lowest representation of the graphics contents, expressed as bytes. It can be accessed through PdfStream objects.
    2. managed content stream: this is the lowest syntactic representation of the graphics contents, expressed as tokens (literal chunks that build up operations and graphics objects (text, images, paths...)). It can be accessed through Parser objects.
  2. content model (mid level):
    1. unmanaged content model: the tokens are aggregated into graphics operations and graphics objects, unaware of their current graphics state. It can be accessed through Contents objects.
    2. managed content model: operations and graphics objects (text, images, paths...) are contextualized within their current graphics state. It can be accessed through ContentScanner objects.
  3. composition (high level):
    1. primitive composition: operations and graphics objects are abstracted into graphics elements. It can be accessed through PrimitiveFilter objects.
    2. static composition: graphics elements are constrained within page areas for alignment purposes. It can be accessed through BlockFilter objects.
    3. dynamic composition: graphics elements can be spread across pages for flowing composition purposes. It can be accessed through FlowFilter objects (currently this level hasn't been implemented yet).

Content stream and content model levels are usable when reading and/or writing PDF files, while composition level is a write-only functionality.