no_repeat_ngram_size= 35
大多数人认为OCR系统不需要特别处理n-gram重复问题,因为这主要在文本生成中重要。作者专门设置了no_repeat_ngram_size参数为35,表明他们的OCR系统需要防止长文本中的重复模式,这挑战了OCR只是简单提取文本而不需要处理文本生成特性的主流认知。
no_repeat_ngram_size= 35
大多数人认为OCR系统不需要特别处理n-gram重复问题,因为这主要在文本生成中重要。作者专门设置了no_repeat_ngram_size参数为35,表明他们的OCR系统需要防止长文本中的重复模式,这挑战了OCR只是简单提取文本而不需要处理文本生成特性的主流认知。
max_length= 32768
大多数人认为OCR模型处理的文本长度受限于模型架构,通常在几千词左右。作者设置的max_length高达32768,这远超传统OCR系统的处理能力,暗示了模型能够处理超长文档而不丢失上下文,挑战了OCR系统的长度限制认知。
Single image supports two configs: gundam or base
大多数人认为OCR模型需要针对特定任务或文档类型进行专门配置,但作者提出单个图像就能支持两种截然不同的配置('gundam'或'base'),这挑战了OCR系统通常需要针对特定场景进行专门配置的行业共识。
Welcome the Era of One-shot Long-horizon Parsing.
大多数人认为OCR技术需要针对不同类型的文档进行多次处理或微调,但作者声称Unlimited-OCR实现了'一次性长距离解析',这挑战了OCR领域需要多次处理的常规认知,暗示一个模型可以处理各种复杂文档而无需专门训练。
OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched.
PDF and OCR conversion of image or scanned pdf to OCRed PDF. Command line on Windows when used with winget installation py -m ocrmypdf --sidecar R.txt --output-type pdf R.pdf R_01.pdf
mypdfocr中文识别空格问题
Make PDF file searchable OCRmyPDF
Reversible_Object-Oriented Intertgfeters
wie wärs mit selbsthilfe?!
diese passive "wir sind konsumenten" scheisse ist doch genau das problem...
ich hab mir das print buch gekauft für 22 euro, hab den buchrücken aufgeschnitten mit ner kreissäge, und hab die 208 seiten durch meinen ADF scanner gejagt (Brother ADS-3000N, 150eur gebraucht). ohne vorbereitung ist das vielleicht ne halbe stunde arbeit. dann noch die scans rotieren, croppen, leveln, und durch tesseract jagen. für tesseract braucht man ne schnelle CPU.
aktuell tu ich die hocr dateien von tesseract korrekturlesen, später werd ich ne pdf draus machen und über libgen.rs auf annas-archive.org hochladen - ein problem weniger.
hocr dateien hab ich hochgeladen auf https://github.com/milahu/enteignung - vielleicht mag wer helfen beim korrekturlesen, dann gehts 1 oder 2 tage schneller.
mann mann mann... als "IT insider" bin ich so gelangweilt von den normies, die beim thema IT vor 20 jahren stehen geblieben sind, kein plan haben von linux, git, python, torproject, monero, ... aber hauptsache scheisse labern in telegram >: (
ChatGPT Vision: The Best Way to Transform Your Paper Notes Into Digital Text
Upload a photo into ChatGPT and request it to transcribe the photo into text. Better than OCR? It creates meaning out of surrounding context; even though words may be wrong.
Can be used to create optical character recognition on .pdf documents and return documents with selectable/machine readable text.
Worried about paper cards being lost or destroyed .t3_y77414._2FCtq-QzlfuN-SwVMUZMM3 { --postTitle-VisitedLinkColor: #9b9b9b; --postTitleLink-VisitedLinkColor: #9b9b9b; --postBodyLink-VisitedLinkColor: #989898; } I am loving using paper index cards. I am, however, worried that something could happen to the cards and I could lose years of work. I did not have this work when my notes were all online. are there any apps that you are using to make a digital copy of the notes? Ideally, I would love to have a digital mirror, but I am not willing to do 2x the work.
u/LBHO https://www.reddit.com/r/antinet/comments/y77414/worried_about_paper_cards_being_lost_or_destroyed/
As a firm believer in the programming principle of DRY (Don't Repeat Yourself), I can appreciate the desire not to do the work twice.
Note card loss and destruction is definitely a thing folks have worried about. The easiest thing may be to spend a minute or two every day and make quick photo back ups of your cards as you make them. Then if things are lost, you'll have a back up from which you can likely find OCR (optical character recognition) software to pull your notes from to recreate them if necessary. I've outlined some details I've used in the past. Incidentally, opening a photo in Google Docs will automatically do a pretty reasonable OCR on it.
I know some have written about bringing old notes into their (new) zettelkasten practice, and the general advice here has been to only pull in new things as needed or as heavily interested to ease the cognitive load of thinking you need to do everything at once. If you did lose everything and had to restore from back up, I suspect this would probably be the best advice for proceeding as well.
Historically many have worried about loss, but the only actual example of loss I've run across is that of Hans Blumenberg whose zettelkasten from the early 1940s was lost during the war, but he continued apace in another dating from 1947 accumulating over 30,000 cards at the rate of about 1.5 per day over 50 some odd years.
Digitizing and compressing notes - Question
reply to: https://www.reddit.com/r/antinet/comments/wv9hvq/digitizing_and_compressing_notes_question/
I've got a process I still use, though less frequently, that does both photos as well as optical character recognition (OCR) to digitize the words: https://boffosocko.com/2021/12/20/handwriting-my-website-with-a-digital-amanuensis/ The comments have some rich commentary with related ideas as well.
I've used ABBY FineReader (best on Windows) and it was much better at correcting OCR than Adobe Acrobat. —Dana Conard
COCO-Text: Dataset for Text Detection and Recognition
See also the COCO-Text V2 site.
Free All-in-one PDF tools A reliable, intuitive and productive PDF Software
A paid Apple based tool for text recognition and extraction
T.LUCRETICARI
Not going to be the prettiest version, but at least somewhat OCR'd for annotating!
Titi Lucreti Cari De Rerum Natura Libri SexWith a Translation and NotesVolume 1Edited by H. A. J. Munro Lucretius
Testing out the OCR functionality of docdrop.org.
I'm noticing that the pdf fingerprint of this text somehow matches that of other texts as there are a lot of non-related annotations on this page.
Is docdrop doing something squirrelly with the fingerprint @dwhly?
Apart from a basic segmenter taken from OCRopus a trainable line extractor is in the process of being implemented. Full trainability of layout analysis is of utmost importance to a truly universal OCR system, as text layout and its semantics varies widely across time and space, e.g. hand-crafted methods for printed Latin text are unlikely to work reliably on Arabic text or manuscripts with extensive interlinear annotation.
wip implementation of line segmentation in kraken
nice recipe for quickly turning a scanned PDF into a searchable one
MyScript MathPad
This looks like something I could integrate into my workflow.
Adobe AcrobatPro.
gImageReader is an excellent open source alternative. It runs both on Windows and Linux, and it provides a simple (yet powerful) frontend GUI to Google's robust open source OCR engine, Tesseract.
I think an open source tool as this is a better fit to the open annotation ecosystem, based on libre software and standards, that Hypothesis promotes, instead of a proprietary (and expensive) tool such as Adobe AcrobatPro.
tessdoc Tesseract documentation
$?
♀♀
<^S
♂♂