{
  "id": "restructuring-pdf-ideology",
  "title": "Restructuring PDF Ideology",
  "author": {
    "name": "Dr. Todd J.B. Blayone",
    "cite_name": "Blayone, T. J. B.",
    "orcid": "0000-0001-6965-7033",
    "profile_url": "../profiles/profile-todd.html"
  },
  "date": "2025-10-19",
  "url": "https://scholarflow.ca/essays/restructuring-pdf-ideology.html",
  "summary": "The PDF made scholarship look stable, but it also trapped knowledge inside page geometry. Restructuring PDF Ideology asks what scholarly texts become when rebuilt as structured, computable artifacts for human-LLM reading.",
  "description": "How PDF publishing constrains human-LLM analysis, and why scholarly texts need structured, computable knowledge formats.",
  "tags": [
    "ScholarFlow",
    "PDF",
    "JSON",
    "human-LLM",
    "knowledge",
    "publishing",
    "structure"
  ],
  "source_type": "ScholarFlow essay",
  "license": "CC BY-NC 4.0",
  "word_count": 997,
  "reading_time_minutes": 5,
  "citations": {
    "apa": "Blayone, T. J. B. (2025, October 19). Restructuring PDF Ideology. ScholarFlow Research. https://scholarflow.ca/essays/restructuring-pdf-ideology.html",
    "bibtex": "@online{restructuringpdfideology2025,\n  title = {Restructuring PDF Ideology},\n  author = {Blayone, T. J. B.},\n  year = {2025},\n  month = {October},\n  url = {https://scholarflow.ca/essays/restructuring-pdf-ideology.html},\n  publisher = {ScholarFlow Research},\n  note = {ScholarFlow essay},\n  urldate = {2025-10-19}\n}",
    "ris": "TY  - ELEC\nTI  - Restructuring PDF Ideology\nAU  - Blayone, T. J. B.\nPY  - 2025\nDA  - 2025-10-19\nPB  - ScholarFlow Research\nUR  - https://scholarflow.ca/essays/restructuring-pdf-ideology.html\nER  -\n"
  },
  "llm_markdown": "---\ntitle: Restructuring PDF Ideology\nauthor: Dr. Todd J.B. Blayone\ndate: 2025-10-19\nsource: https://scholarflow.ca/essays/restructuring-pdf-ideology.html\nsource_type: ScholarFlow essay\nlicense: CC BY-NC 4.0\nreuse_terms: Non-commercial reuse permitted with proper academic source attribution.\ntags:\n  - ScholarFlow\n  - PDF\n  - JSON\n  - human-LLM\n  - knowledge\n  - publishing\n  - structure\n---\n\n# Suggested LLM Discussion Prompt\n\nPlease discuss this essay as a scholarly text. Preserve source attribution, distinguish the author's claims from your analysis, and use the essay as the primary context for interpretation.\n\n# Essay Text\n\nScholarly writing encodes structure in prose. Arguments are staged through sections, evidence is embedded in citations, and meaning depends on the choreography of textual and visual elements. Machines, however, encounter only surface geometry. They see fonts, coordinates, and bounding boxes. The challenge of effective human-LLM reading and interpretation is therefore one of reconstruction: recovering a shared ability to recognize logical and rhetorical form from typographic residue.\n\nThis predicament is historical, not technical. The *Portable Document Format* (PDF) became dominant mainly because most publishers, as inheritors of a print age, were locked into legacy production workflows and often lacked the expertise and incentive to evolve. PDF solved an early problem of the computer age: consistent rendering. PDF sought to ensure that every page appeared identical on both the screen and the printed page. This “guarantee” hardened into a belief system—the ideology of presentation grounded on the printed page. A properly formatted surface became synonymous with intellectual authority, and layout complexity and fidelity were closely aligned with scholarly integrity.\n\nThis ideology persists even as it obstructs progress in an age of potentially productive human-LLM activity. The academic publishing pipeline remains anchored to “pagination,” enforcing processes that strip explicit structure and meaning from born-digital manuscripts. Authors compose in richly structured digital environments, yet what circulates are visual facsimiles that humans must reread and machines must painstakingly decode. Every table, equation, and reference that could be computationally linked is instead flattened into pixels and coordinates.\n\nLarge language models expose this dysfunction. They can infer section hierarchies, recover citations, and map argumentative flows, but only after expending vast computational effort undoing the constraints imposed by the publishing system. The bottleneck is not intelligence but form. Humans are forcing machines to read as humans once had to—line by line, visually—when what is needed is a structured, interoperable representation of knowledge.\n\nAcademic preprint databases such as *arXiv* demonstrate the paradox clearly. Authors submit LaTeX sources, which are structurally interpretable, but the system still distributes a PDF as the canonical object. The structured text exists, yet it is entombed. Even recent moves toward EPUB merely transfer the logic of print to “e-readers”—devices designed for controlled consumption, not computation. These are not new media but digitized continuations of an old ideology.\n\n*ScholarFlow* begins from this impasse, but pursues liberatory knowledge pipelines. The project acknowledges that PDFs remain the default container for now, but treats them as inputs to be undone. Through automated parsing, interpretation, and reconstruction, PDFs become a starting point in the effective re-representation of structural knowledge. The goal is not to modernize the page but to liberate the content it imprisons—transforming scholarship from a static artifact into structured data that both humans and machines can read.\n\nOf course, the desire to make complex texts machine-readable long predates large language models. From the 1960s onward, information scientists, librarians, and computational humanists recognized that unstructured text was a dead end for computation. The shared goal was to represent documents in ways that machines could interpret, query, and transform without human mediation. What emerged was a lineage of markup systems—each technically ambitious and each revealing the tension between theoretical completeness and practical adoption.\n\nThe first significant attempt was SGML (Standard Generalized Markup Language), formalized in the 1980s. SGML was a triumph of abstraction: a meta-language for defining document types and relationships. It promised universal compatibility between publishers, databases, and research archives. Yet it was engineered for engineers. Authoring required manual tagging; validation depended on complex Document Type Definitions (DTDs), and the intellectual overhead of compliance most often outweighed the benefits for “average” users. Thus, as SGML succeeded as a proof of concept, it failed as a daily tool for working scholars.\n\nXML (Extensible Markup Language), introduced in the late 1990s, sought to domesticate SGML for the web. It simplified syntax and became the foundation for numerous specialized dialects (e.g., MathML, DocBook, TEI, JATS), each designed to encode domain-specific knowledge. These standards made genuine progress toward interoperability but carried the same genetic flaw: they treated text as an engineering problem. XML-based systems privileged precision and formal validation over ease of authoring and adaptation. Projects in digital humanities and scientific publishing became marathons of schema design and argument about semantics, not engines of accessible practice.\n\nThe pattern repeated across disciplines. Initiatives like TEI (Text Encoding Initiative) or JATS (Journal Article Tag Suite) provided formidable expressive power but demanded high entry costs and professionalized maintenance. They solved technical representation problems while ignoring the social and motivational forces that drive adoption. In short, they produced *perfect standards for systems that few used in everyday practice*. Structured representation became an academic pursuit rather than a working infrastructure.\n\nMeanwhile, outside academia, the web quietly standardized on a different philosophy: *good enough structure, everywhere.* JSON (JavaScript Object Notation) emerged in the early 2000s as a lightweight data-interchange format. It wasn’t perfect, but it was practical—simple enough to read, write, and debug without training. In contrast to XML’s rigidity, JSON’s flexibility allowed systems to evolve iteratively. *Developers embraced it because it served a use problem, not a purity problem.*\n\nThis divergence explains why structured representation in academia stagnated. The markup community optimized for internal coherence; the broader digital world optimized for usability and integration. When LLMs arrived, they readily interfaced with the latter ecosystem. JSON’s simplicity and ubiquity made it the lingua franca of AI pipelines. It was never designed for scholarly documents, but its design philosophy—clarity through constraint—turned out to be exactly what human-machine knowledge work needed.\n\n*ScholarFlow*’s philosophical and technical approach grows out of that lesson. It accepts that perfect representation is a mirage. The goal is not to model every nuance of scholarly expression but to create a resilient, interpretable layer that supports real use: parsing, retrieval, reasoning, analysis and synthesis. Where earlier systems sought to preserve the entire ontology of a text, *ScholarFlow*’s JSON format aims to capture just enough structure to make complex reading computationally possible, and to do so in a way that ordinary researchers can understand and extend.\n",
  "body_text": "Scholarly writing encodes structure in prose. Arguments are staged through sections, evidence is embedded in citations, and meaning depends on the choreography of textual and visual elements. Machines, however, encounter only surface geometry. They see fonts, coordinates, and bounding boxes. The challenge of effective human-LLM reading and interpretation is therefore one of reconstruction: recovering a shared ability to recognize logical and rhetorical form from typographic residue.\n\nThis predicament is historical, not technical. The *Portable Document Format* (PDF) became dominant mainly because most publishers, as inheritors of a print age, were locked into legacy production workflows and often lacked the expertise and incentive to evolve. PDF solved an early problem of the computer age: consistent rendering. PDF sought to ensure that every page appeared identical on both the screen and the printed page. This “guarantee” hardened into a belief system—the ideology of presentation grounded on the printed page. A properly formatted surface became synonymous with intellectual authority, and layout complexity and fidelity were closely aligned with scholarly integrity.\n\nThis ideology persists even as it obstructs progress in an age of potentially productive human-LLM activity. The academic publishing pipeline remains anchored to “pagination,” enforcing processes that strip explicit structure and meaning from born-digital manuscripts. Authors compose in richly structured digital environments, yet what circulates are visual facsimiles that humans must reread and machines must painstakingly decode. Every table, equation, and reference that could be computationally linked is instead flattened into pixels and coordinates.\n\nLarge language models expose this dysfunction. They can infer section hierarchies, recover citations, and map argumentative flows, but only after expending vast computational effort undoing the constraints imposed by the publishing system. The bottleneck is not intelligence but form. Humans are forcing machines to read as humans once had to—line by line, visually—when what is needed is a structured, interoperable representation of knowledge.\n\nAcademic preprint databases such as *arXiv* demonstrate the paradox clearly. Authors submit LaTeX sources, which are structurally interpretable, but the system still distributes a PDF as the canonical object. The structured text exists, yet it is entombed. Even recent moves toward EPUB merely transfer the logic of print to “e-readers”—devices designed for controlled consumption, not computation. These are not new media but digitized continuations of an old ideology.\n\n*ScholarFlow* begins from this impasse, but pursues liberatory knowledge pipelines. The project acknowledges that PDFs remain the default container for now, but treats them as inputs to be undone. Through automated parsing, interpretation, and reconstruction, PDFs become a starting point in the effective re-representation of structural knowledge. The goal is not to modernize the page but to liberate the content it imprisons—transforming scholarship from a static artifact into structured data that both humans and machines can read.\n\nOf course, the desire to make complex texts machine-readable long predates large language models. From the 1960s onward, information scientists, librarians, and computational humanists recognized that unstructured text was a dead end for computation. The shared goal was to represent documents in ways that machines could interpret, query, and transform without human mediation. What emerged was a lineage of markup systems—each technically ambitious and each revealing the tension between theoretical completeness and practical adoption.\n\nThe first significant attempt was SGML (Standard Generalized Markup Language), formalized in the 1980s. SGML was a triumph of abstraction: a meta-language for defining document types and relationships. It promised universal compatibility between publishers, databases, and research archives. Yet it was engineered for engineers. Authoring required manual tagging; validation depended on complex Document Type Definitions (DTDs), and the intellectual overhead of compliance most often outweighed the benefits for “average” users. Thus, as SGML succeeded as a proof of concept, it failed as a daily tool for working scholars.\n\nXML (Extensible Markup Language), introduced in the late 1990s, sought to domesticate SGML for the web. It simplified syntax and became the foundation for numerous specialized dialects (e.g., MathML, DocBook, TEI, JATS), each designed to encode domain-specific knowledge. These standards made genuine progress toward interoperability but carried the same genetic flaw: they treated text as an engineering problem. XML-based systems privileged precision and formal validation over ease of authoring and adaptation. Projects in digital humanities and scientific publishing became marathons of schema design and argument about semantics, not engines of accessible practice.\n\nThe pattern repeated across disciplines. Initiatives like TEI (Text Encoding Initiative) or JATS (Journal Article Tag Suite) provided formidable expressive power but demanded high entry costs and professionalized maintenance. They solved technical representation problems while ignoring the social and motivational forces that drive adoption. In short, they produced *perfect standards for systems that few used in everyday practice*. Structured representation became an academic pursuit rather than a working infrastructure.\n\nMeanwhile, outside academia, the web quietly standardized on a different philosophy: *good enough structure, everywhere.* JSON (JavaScript Object Notation) emerged in the early 2000s as a lightweight data-interchange format. It wasn’t perfect, but it was practical—simple enough to read, write, and debug without training. In contrast to XML’s rigidity, JSON’s flexibility allowed systems to evolve iteratively. *Developers embraced it because it served a use problem, not a purity problem.*\n\nThis divergence explains why structured representation in academia stagnated. The markup community optimized for internal coherence; the broader digital world optimized for usability and integration. When LLMs arrived, they readily interfaced with the latter ecosystem. JSON’s simplicity and ubiquity made it the lingua franca of AI pipelines. It was never designed for scholarly documents, but its design philosophy—clarity through constraint—turned out to be exactly what human-machine knowledge work needed.\n\n*ScholarFlow*’s philosophical and technical approach grows out of that lesson. It accepts that perfect representation is a mirage. The goal is not to model every nuance of scholarly expression but to create a resilient, interpretable layer that supports real use: parsing, retrieval, reasoning, analysis and synthesis. Where earlier systems sought to preserve the entire ontology of a text, *ScholarFlow*’s JSON format aims to capture just enough structure to make complex reading computationally possible, and to do so in a way that ordinary researchers can understand and extend."
}
