⛏️ Data collection #
Data mining #
This release of the project already includes GitHub’s mined resource files, and logs of the events that took
place during the mining process, in src/main/resources/<lang-name>/raw
, where <lang-name>
is one of the 3 target programming languages: java
, kotlin
and python3
.
In order to repeat the mining process for any of the languages, one must obtain a GitHub API Token, and insert
it into the file src/main/python/github_miner.py
as indicated (API_TOKEN
field), and hence run the
script from within its directory.
Preprocessing #
Preprocessing is the task of obtaining an oracle of 20k files with correct language derivations, for each one of the three languages.
Most of the logic is contained in src/main/kotlin/preprocessor
, whereas the formal grammatical
syntax highlighters are contained in src/main/kotlin/highlighter
.
The process of oracle generation is divided into two step: generation and cleaning. These are run for each language, hence, for example, with regard to Java, one would run the following two Gradle tasks:
./gradlew JavaPreprocessor -Pargs="generateOracle"
which filters and converts GitHub’s raw files sequentially into valid ETA
and HETA
representations;
./gradlew JavaPreprocessor -Pargs="cleanOracle"
which through skipping token-wise duplicate patterns, takes the first 20k samples in the dataset.
This can be performed for Kotlin and Python datasets by changing the target Gradle task to KotlinPreprocessor
and
Python3Preprocessor
respectively.
Note that this project already provides oracles for all languages, which are stored in
src/main/resources/<lang-name>/oracle