Setting up the Coding Environment
1 Poetry
We will work with poetry to manage the dependencies of the project. Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you.
To install Poetry with brew for Mac users, run the following command in the Terminal app:
brew install poetry
For Window users, you can install Poetry by running the following command in the command prompt:
pip install poetry
By default, Poetry installs Python for each project in the ~/Library/Caches/pypoetry/virtualenvs/
directory. I prefer to have it in the project directory. That way if, I delete the directory, then the environment is deleted as well, which prevents accumulating virtual environments for discarded projects. To enable this, run the following command:
poetry config virtualenvs.in-project true
To create a poetry environment, first clone the repository and navigate to the root directory of the project. Then run the following command:
~/codes/ # where codes is the directory where you store the cloned repositories
cd poetry init
Once the project is created, you can add the dependencies:
-learn linearmodels pyarrow jupyter pytest hydra-core poetry add pandas numpy scipy matplotlib seaborn statsmodels scikit
You can always add more dependencies later by running the same command with the additional dependencies.
This step updates the pyproject.toml
file and creates a poetry.lock
file, which contains the exact version of each dependency. This file is used to make sure that all collaborators use the same version of each library. Note that because our dependencies are built on top of other libraries, Poetry will also install the dependencies of our dependencies.
To activate the environment in the terminal, run the following command:
poetry shell
Usually, after opening your GitHub repo through VS Code, the terminal will show the name of the Python environment in the prompt that was installed via Poetry. You won’t need to run poetry shell
in this case.
If the poetry.lock
file is already present when you clone the repo, after install poetry, simply do
poetry install
2 .env file
We will use a .env
file to store the environment variables. This file will be used to store the API keys and other sensitive information such as the root of the data directory. We will store the data on a server (and maybe some on Dropbox) and the directory root to the data is different for each user. The .env
file should be in the root directory of the project. The .env
file should not be committed to the repository. To prevent this, we will add .env
to the .gitignore
file. The .env
file should look like this:
=your_api_key
API_KEY=/path/to/data
DATA_DIR=./results/figures/
FIG_DIR=./results/tables/
TBL_DIR=./tmp/ TMP_DIR
To read the environment variables, we will use the python-dotenv
library. To install it, run the following command in terminal:
poetry add python-dotenv
To read the environment variables, the following code will be added to the main.py
file such that the environment variables are read when the script is run:
from dotenv import load_dotenv
from pathlib import Path
import os
load_dotenv()
= os.getenv("API_KEY")
api_key
= os.getenv("DATADIR")
datadir_path if not datadir_path:
raise ValueError("DATADIR environment variable not set")
= Path(datadir_path)
data_dir = data_dir / "download_cache/"
download_dir = data_dir / "open/"
open_dir = data_dir / "restricted/"
restricted_dir = data_dir / "clean/"
clean_dir = Path(os.getenv("TMP_DIR", "tmp/"))
tmp_dir
= Path(os.getenv("FIG_DIR", "./results/figures/"))
fig_dir = Path(os.getenv("TBL_DIR", "./results/tables/")) tab_dir
3 Hydra
We will use Hydra to manage the configuration of the project. Hydra is a framework for elegantly configuring complex applications. It is used to manage the configuration of the project. Hydra allows you to define a configuration file with default values and then override these values with command-line arguments. This is useful when you want to run the same script with different parameters. To install Hydra, run the following command in the terminal:
poetry add hydra-core
To use Hydra, you need to create a configuration file. The configuration file should be in the conf
directory. The configuration file should be a YAML file. The configuration file should look like this:
matplotlib:
font:
family: serif
sans_serif:
- Helvetica
serif:
- "Computer Modern Roman"
download_data:
crsp: false
compustat: false
process_raw_data:
ibes_sue: false
database:
# panel
build_panel_db: false
save_panel_db: false
load_panel_db: true
tasks:
# ---- Figures ----
long_short_cumul_ret_fig: false
To read the configuration file, the following example code will be added to the main.py
file such that the configuration file is read when the script is run:
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(version_base=None, config_path="./conf", config_name="config")
def my_app(cfg: DictConfig):
configure_pyplot(=cfg.matplotlib.font.family,
font_family=cfg.matplotlib.font.serif,
font_serif=cfg.matplotlib.font.sans_serif,
font_sans_serif
)
download_files(=cfg.download_data.crsp,
crsp=cfg.download_data.compustat,
compustat
)
# Process raw files
process_raw_files(=cfg.process_raw_data.ibes_sue,
ibes_sue
)
if cfg.database.build_panel_db:
"Build panel daily database")
logging.info(= create_panel_dataset(
panel_db =open_dir,
open_dir=restricted_dir,
restricted_dir="2012-01-01",
start_date="2022-12-31",
end_date
)if cfg.database.save_panel_db:
"Saving panel daily database")
logging.info(/ "panel_db.parquet") panel_db.to_parquet(clean_dir
4 Code directory
The code directory in Git should be structured as follows (with .py examples):
/
main_code/
conf
config.yaml/
database__init__.py
download_data.py
process_raw_data.py
build_panel_db.py/
figure_codes__init__.py
/
table_codes__init__.py
regression_table_1.py
regression_table_2.py/
utils__init__.py
common_functions.py/
tests__init__.py
test_download_data.py
test_process_raw_data.py test_build_panel_db.py
5 Data directory
The data directory in our shared drive should be structured as follows:
/
data/
download_cacheopen/
/
restricted/ clean
The download_cache
directory is used to store the raw data files downloaded from the internet. The open
directory is used to store the raw data files that are open to the public. The restricted
directory is used to store the raw data files that are restricted. The clean
directory is used to store the cleaned data files.
We will not save the raw and processed data on Git. We will use our shared dropbox folder.
6 Saving data
We should save our data using parquet format.
7 Naming convention
For default variables, like model parameters we use capital letters. E.g., N_LOOPS = 40 for the default number of loops. For others, we use small cap. E.g., n_loops = 40 for the number of loops in a specific case. We will store default variables in a separate file called parameters.py
in the utils
directory.
8 File formatting on saving
We will use the black formatter to format our code. To install black in VS Code add the extension Black Formatter
to format the code automatically. To enable this, go to the settings and search for format on save
and check the box.