README ¶
* FAQ :PROPERTIES: :CREATED: [2019-07-04 Thu 14:12] :CUSTOM_ID: 76ff29ca-c783-40d6-9edf-1eadc3b4d575 :END: - What is the problem to solve? If you have exported emails from MS Outlook, you're also limited to MS Outlook in order to read them. And if there are a lot of pst files, it becomes even more tedios. - What can we do about it? No clear idea yet. There are tools for reading or converting pst file such as readpst, apache tika. Using readpst, I will try to convert a batch of pst files into plain text together with their attachments. Then emails will be parsed and converted to a sqlite file. Any search queries that will subsequently be performed should work well. * How it works? :PROPERTIES: :CREATED: [2019-07-04 Thu 14:13] :CUSTOM_ID: 2d937bb9-199f-4237-8124-5fac6925fcde :END: 1. User puts the pst archives into /folder 1/. 2. Using readpst tool, extracts all data to /folder 2/. 3. /program 1/ reads all data in /folder 2/, parses it and converts it to an /sqlite file/ (why sqlite? Because of its simplicity, flexibility of sql - which should provide better performance.) 4. /program 2/ , based on a search query, makes selections from the /sqlite file/ and outputs all matching emails and attachments to /output folder/ (separate folder for each query). Lets try an example. We will use part of Enron data set - it's big (~50 GB) and publically available. Assuming golang and readpst are installed: #+BEGIN_SRC mkdir /tmp/test-pst cd /tmp/test-pst curl -O https://s3.amazonaws.com/edrm.download.nuix.com/RevisedEDRMv1/albert_meyers.zip unzip albert_meyers.zip mv albert_meyers/albert_meyers_000_1_1.pst . rm -fr albert_meyers rm albert_meyers.zip #extracting archive to files mkdir extract readpst -S -D -o extract albert_meyers_000_1_1.pst git clone https://github.com/tonna/search-pst cd search-pst go get github.com/mattn/go-sqlite3 go build -o reader.so main1.go && go build -o search.so main2.go cd - ./search-pst/reader.so -input=/tmp/test-pst/extract -output=db mkdir found echo "select path, content from email where content like '%1%' limit 5;" > query-dummy.sql ./search-pst/search.so -input=/tmp/test-pst/db -output=/tmp/test-pst/found -query=query-dummy.sql ls /tmp/test-pst/found -R #+END_SRC * Assumptions :PROPERTIES: :CREATED: [2019-07-06 Sat 22:58] :CUSTOM_ID: 8541713b-b786-424d-a480-9173c33fb632 :END: readpst exports files in a way that email and attachments are presented as #+BEGIN_SRC folder1/folder2/email1 folder1/folder2/email2 folder1/folder2/email2-attachment1 folder1/folder2/email2-attachment2 folder1/folder3/email1 #+END_SRC * Misc :PROPERTIES: :CREATED: [2019-07-07 Sun 00:11] :CUSTOM_ID: 31eccff8-725e-4f20-92f3-fd5c85364a77 :END: Building #+BEGIN_SRC git clone https://github.com/tonna/search-pst cd search-pst #dependency go get github.com/mattn/go-sqlite3 go build -o reader.so main1.go && gofmt -s -w *.go go build -o search.so main2.go && gofmt -s -w *.go #+END_SRC * todo-list :PROPERTIES: :CREATED: [2019-07-23 Tue 20:33] :CUSTOM_ID: 3e21789d-81f4-4c78-b9b7-d95e5e5b751f :END: ** NEXT Search accepts SQL queries :PROPERTIES: :CREATED: [2019-07-23 Tue 20:33] :CUSTOM_ID: 34d60713-6d39-4243-a521-6aeb1f976e02 :END: ** TODO Match email and attachment files :PROPERTIES: :CREATED: [2019-07-23 Tue 20:34] :CUSTOM_ID: 6a67e72c-4ebb-4e14-bb0b-d1fbdf3d3c38 :END: Every email file that has attachments, if found in search, should be extracted with attachments. How and when match emails and attachments? When reading I think. Primary and foreign keys? ** TODO Come up with found file naming :PROPERTIES: :CREATED: [2019-08-06 Tue 21:52] :CUSTOM_ID: c81be8dd-c3ba-445c-9c35-8faa4bef3ffe :END:
Documentation ¶
There is no documentation for this package.
Click to show internal directories.
Click to hide internal directories.