stackexchange-xml-to-csv
CLI tool that allows you to convert Stack Exchange data dumps from XML
to CSV
format, which is more suitable for importing to the different databases.
Table of contents
Getting started
Before, ensure that you have:
- Working Go environment with go version >= 1.14. Execute in the console
go version
command. It should display the current version of the compiler.
- Archiver that can extract
.7z
files. Possible candidate is 7z.
Download database dump
Choose and download the database dump that you are going to convert.
Important: Stackoverflow dump stored in 8 separated 7z archives:
Extract archive(s) content file(s) to the directory from where you will convert files using 7z
or another archiver.
Example with with academia.stackexchange.com.7z dump:
$ mkdir xml csv
$ 7z e academia.stackexchange.com.7z -oxml
$ ls xml/
Badges.xml Comments.xml PostHistory.xml PostLinks.xml Posts.xml Tags.xml Users.xml Votes.xml
Building of stackexchange-xml-to-csv
Clone & build stackexchange-xml-to-csv
converter:
$ git clone https://github.com/SkobelevIgor/stackexchange-xml-to-csv
$ cd stackexchange-xml-to-csv/
$ go build
XML to CSV converting
Now you have stackexchange-xml-to-csv
executable file. Let’s convert XML files:
./stackexchange-xml-to-csv -—source-path=../xml --store-to-dir=../csv
List of possible flags:
source-path
(Required) Absolute or relative path to the directory with an XML file(s) or to the separate XML file.
store-to-dir
(Optional) Absolute or relative path to the directory where to store result CSV files.
skip-html-decoding
(Optional) Some of the files (e.g., Posts.xml) contain escaped HTML. By default, the converter will decode them. To disable this behavior, use this flag.
RDBMS schema examples
Here you can find examples of the schema for the different databases:
License
MIT License