pdftk: The PDF Toolkit

添加目录与修改元信息

从pdf中提取目录和元信息

   pdftk C:\Users\Sid\Desktop\doc.pdf dump_data output C:\Users\Sid\Desktop\doc_data.txt

给pdf添加或更新目录

   pdftk C:\Users\Sid\Desktop\doc.pdf update_info C:\Users\Sid\Desktop\doc_data.txt output C:\Users\Sid\Desktop\updated.pdf

目录文件格式:

   BookmarkBegin
   BookmarkTitle: PDF Reference (Version 1.5)
   BookmarkLevel: 1
   BookmarkPageNumber: 1
   BookmarkBegin
   BookmarkTitle: Contents
   BookmarkLevel: 2
   BookmarkPageNumber: 3

其中字符应编码成 html entitle 的样子。 可以用 https://tool.oschina.net/encode , https://www.online-toolz.com/tools/unicode-html-entities-convertor.php 这里的 Native/Unicode 转换工具。

或者python

   u"阿斯顿".encode('ascii', 'xmlcharrefreplace')

   def utf8_to_ascii_xml(filename):
       with open(filename,'r',encoding='utf-8') as fi:
           lines = fi.readlines()
           lines_ascii = [x.encode('ascii', 'xmlcharrefreplace') for x in lines]
       with open(f"{filename}.ascii", 'wb') as fi:
           fi.writelines(lines_ascii)

常用正则替换

s/\(.*\),\(.*\)/BookmarkBegin\nBookmarkTitle: \1\nBookmarkLevel: 2\nBookmarkPageNumber: \2/

添加目录的示例流程

   # import
   import pytesseract
   from PIL import Image
   import numpy as np
   import pandas as pd
   import re

   # 利用 OCR 生成目录
   print(pytesseract.image_to_string(Image.open('a.png')))

   # 手动编辑目录,使用 orgmode 的表格
   data = pd.read_csv("a.txt", sep="|")
   darr = np.array(data)
   darr[:,1] = [i.strip() for i in darr[:,1]]
   darr[:,2] = [i.strip() for i in darr[:,2]]
   darr[2:,3] = darr[2:,3]+9
   data = pd.DataFrame(darr[:,1:-1])
   data

   # 生成目录格式并保存
   block = """BookmarkBegin
   BookmarkTitle: {title}
   BookmarkLevel: {level}
   BookmarkPageNumber: {page}
   """
   txt = ""
   for line in darr:
       index = line[1]
       title = line[2]
       page = line[3]
       if index == '' or index[-1] == '.':
           level = 1
       else:
           level = 2
           txt += block.format(title=f"{index} {title}", level=level, page=page)

   with open("b.txt",'w') as fi:
       fi.writelines(txt)

合并pdf文件

  pdftk in1.pdf in2.pdf cat output out1.pdf

djvu to pdf

  1. djvulibre
  ddjvu -format=pdf -quality=85 -verbose a.djvu a.pdf

分割文件

pdftk in.pdf burst output out_%04d.pdf

评论

Comments powered by Disqus