search-pst

command module
v0.0.0-...-50afaee Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 22, 2019 License: MIT Imports: 9 Imported by: 0

README

* FAQ
  :PROPERTIES:
  :CREATED:  [2019-07-04 Thu 14:12]
  :CUSTOM_ID: 76ff29ca-c783-40d6-9edf-1eadc3b4d575
  :END:
- What is the problem to solve?

  If you have exported emails from MS Outlook, you're also limited to MS Outlook in order to read them. And if there are a lot of pst files, it becomes even more tedios.

- What can we do about it?

  No clear idea yet. There are tools for reading or converting pst file such as readpst, apache tika.

  Using readpst, I will try to convert a batch of pst files into plain text together with their attachments. Then emails will be parsed and converted to a sqlite file. Any search queries that will subsequently be performed should work well.

* How it works?
  :PROPERTIES:
  :CREATED:  [2019-07-04 Thu 14:13]
  :CUSTOM_ID: 2d937bb9-199f-4237-8124-5fac6925fcde
  :END:

1. User puts the pst archives into /folder 1/.
2. Using readpst tool, extracts all data to /folder 2/.
3. /program 1/ reads all data in /folder 2/, parses it and converts it to an /sqlite file/ (why sqlite? Because of its simplicity, flexibility of sql - which should provide better performance.)
4. /program 2/ , based on a search query, makes selections from the /sqlite file/ and outputs all matching emails and attachments to /output folder/ (separate folder for each query).

Lets try an example. We will use part of Enron data set - it's big (~50 GB) and publically available.

Assuming golang and readpst are installed:

#+BEGIN_SRC
mkdir /tmp/test-pst
cd /tmp/test-pst

curl -O https://s3.amazonaws.com/edrm.download.nuix.com/RevisedEDRMv1/albert_meyers.zip
unzip albert_meyers.zip
mv albert_meyers/albert_meyers_000_1_1.pst .
rm -fr albert_meyers
rm albert_meyers.zip

#extracting archive to files
mkdir extract
readpst -S -D -o extract albert_meyers_000_1_1.pst

git clone https://github.com/tonna/search-pst
cd search-pst

go get github.com/mattn/go-sqlite3
go build -o reader.so main1.go && go build -o search.so main2.go
cd -

./search-pst/reader.so -input=/tmp/test-pst/extract -output=db

mkdir found
echo "select path, content from email where content like '%1%' limit 5;" > query-dummy.sql
./search-pst/search.so -input=/tmp/test-pst/db -output=/tmp/test-pst/found -query=query-dummy.sql
ls /tmp/test-pst/found -R
#+END_SRC

* Assumptions
  :PROPERTIES:
  :CREATED:  [2019-07-06 Sat 22:58]
  :CUSTOM_ID: 8541713b-b786-424d-a480-9173c33fb632
  :END:
readpst exports files in a way that email and attachments are presented as

#+BEGIN_SRC
folder1/folder2/email1
folder1/folder2/email2
folder1/folder2/email2-attachment1
folder1/folder2/email2-attachment2
folder1/folder3/email1
#+END_SRC

* Misc
  :PROPERTIES:
  :CREATED:  [2019-07-07 Sun 00:11]
  :CUSTOM_ID: 31eccff8-725e-4f20-92f3-fd5c85364a77
  :END:
Building

#+BEGIN_SRC
git clone https://github.com/tonna/search-pst

cd search-pst

#dependency
go get github.com/mattn/go-sqlite3

go build -o reader.so main1.go && gofmt -s -w *.go
go build -o search.so main2.go && gofmt -s -w *.go
#+END_SRC

* todo-list
  :PROPERTIES:
  :CREATED:  [2019-07-23 Tue 20:33]
  :CUSTOM_ID: 3e21789d-81f4-4c78-b9b7-d95e5e5b751f
  :END:

** NEXT Search accepts SQL queries
   :PROPERTIES:
   :CREATED:  [2019-07-23 Tue 20:33]
   :CUSTOM_ID: 34d60713-6d39-4243-a521-6aeb1f976e02
   :END:
** TODO Match email and attachment files
   :PROPERTIES:
   :CREATED:  [2019-07-23 Tue 20:34]
   :CUSTOM_ID: 6a67e72c-4ebb-4e14-bb0b-d1fbdf3d3c38
   :END:

Every email file that has attachments, if found in search, should be extracted with attachments.

How and when match emails and attachments? When reading I think. Primary and foreign keys?

** TODO Come up with found file naming
   :PROPERTIES:
   :CREATED:  [2019-08-06 Tue 21:52]
   :CUSTOM_ID: c81be8dd-c3ba-445c-9c35-8faa4bef3ffe
   :END:

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL