sofiatraffic-crawler

command module
v0.0.0-...-3cfcab7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 25, 2018 License: MIT Imports: 4 Imported by: 0

README

Sofia Traffic Crawler Build Status codecov

Goal

The main goal of the crawler is to crawl only Sofia Traffic websites (http://schedules.sofiatraffic.bg) and first use it as a tool to get the current infrastructure of the public transportation of Sofia by crawling all possible stations, lines, schedules, etc. It also tries to match the data between listed above sites, which differs A LOT. In future the tool should be able to detect the frequent changes to the structure. Another goal of the crawler is to poll for the times of each active stop which is included in the infrastructure on given operation mode.

That poll can be every couple of seconds at best. The idea of the poll is to proxy it via the API with the value it gives and later on to use it for data learning purposes

Sources and types of extracted data

The sources for the data are different: For predefined lines, directions, operation modes, stops and schedules we can use schedules.sofiatraffic.bg and some of it 'hidden' services.

Some noted problems

Those times can disappear as if the vehicle has arrived and then reappear. Another problem is that they don't always disappear after that time or the vehicle has arrived and can stay for observation more than 15 minutes (possibly more) after the time has passed.

Problem is silent addition to lines with schedules which serve only 1 day purpose.

Another problem is that some lines have 8 different routes that they serve.

Some lines do not operate during the weekend, others during holidays, etc. Some lines are marked for removal or update of some of the stops, but the notes are in Bulgarian and are practically un-parsable without NPL.

Usage

In order to use the crawler - you need redis and you have to pass a redis connection pool to the crawler.

After that you can start 4 types of crawls, which have some dependencies between them, but are cache-able in redis, so you don't have to run them each time.

  • CrawlLines will extract all the lines, directions, operation modes, stops information with some internal IDs
  • CrawlSchedules will use the information from the the previous call and get all the schedules for each stop in the traffic network for every line, operation mode, and direction.
  • CrawlVirtualTablesLines will use the information from CrawlLines and in order to extract the IDs for stops there

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL