python - Not receiving correct pattern from regex on PyPDF2 for a PDF -
i want extract instances of particular word pdf e.g 'math'. far converting pdf text using pypdf2 , doing regex on find want. example pfd
when run code instead of returning regular expression pattern of 'math' returns string of whole page. please thanks
#first change current working directory desktop import os os.chdir('/users/hussein/desktop') #file located on desktop #second pypdf2 pdffileobj=open('test1.pdf','rb') #opening file pdfreader=pypdf2.pdffilereader(pdffileobj) pageobj=pdfreader.getpage(3) #for test need page 3 textversion=pageobj.extracttext() print(textversion) #third regular expression import re match=re.findall(r'math',textversion) match in textversion: print(match)
instead of getting instances of 'math' receive this:
i n t r o d u c t o n
etc etc
the textversion
variable holds text. when use for
loop, give text character @ time have seen. findall
function return list of matches, if use instead for
loop each word (which in test same).
import re match in re.findall(r'math',textversion): print(match)
the returned result findall
like:
["math", "math", "math"]
so output be:
math math math
Comments
Post a Comment