Lecture 01b - Python Installation
Welcome to this discussion on building a scientific computing environment in Python for our big data machine learning and AI course. The goals are to get you up to speed with Git and GitHub, Python, and VS Code, and to launch you into Assignment 1. This assignment will check that you’ve got everything installed and ask which topics you’re most interested in.
Git and Getting Organized
If you haven’t already, create a new account on github.com. Make absolutely sure to choose a good username, because in my experience, this username will stick with you for a very long time. You can associate it with whatever email you want, but I would suggest that you use one that you know for sure you’ll have permanent access to. Even though I’m tenured, I still don’t trust University of Minnesota’s OIT, so I use my Gmail account. Of course, that’s kind of suspect because I also don’t trust Google, but here we are. This will be one of the things that you turn in for Assignment 1.
Next, we need to get the Git software. First, it’s worth noting the difference between two very similarly named things. GitHub is a website owned by Microsoft that hosts code and connects a community of coders to each other’s code bases. Git, however, is version control software that records the history of all the files in a Git repository. Nearly every coder uses Git to push their repository of code up to GitHub. That’s the relationship between the two.
Let’s download Git by going to the link provided. For Windows users, select the Windows option. I’ve already done it on my machine, so you can explore how to do it on your own.
Installing Python
Once you’ve got Git installed, we’re going to install Python. There are many different distributions of Python, and I think I found a way that’s relatively easy. We’re going to use CondaForge. Navigate to condaforge.org/download and select your operating system. If you have an M1 or later chip for Mac, make sure that you select the Apple Silicon option, not the old x86-64 Intel architecture.
If you’re on Windows and you get a “Windows protected your PC” message, just click More Info and then Run Anyway.
What you’ve now installed is MiniForge. I recommend you install it with the default options, but install it just for yourself. I’m going to recommend you install it in a specific location. If you want to follow along exactly, make sure in your C drive (or Mac equivalent), in your users directory, in your username (mine would be JA Johns), we’re going to install this into MiniForge3. I’d recommend keeping all the defaults, except make sure that you add MiniForge3 to your path environment variable. This will ensure that it can be called from the command line as needed.
A Quick Word on Distribution Methods
We’re drawing from Anaconda in a series of iterations on it. Anaconda is a full, large Python distribution with thousands of libraries and very easy to start, but it was heavy and slow and had licensing limits. One of the things it came with was called Conda, distinct from Anaconda. Conda is the package manager, and it’s more flexible and cross-language, letting you download your own set of libraries rather than relying on the existing distribution.
However, Conda was quite slow to solve. When you have thousands of different libraries that all need to work together, and they all have different requirements of what they need to have installed, it’s actually a non-trivial tree search question for what are the correct versions of all the different things. Therefore, Conda got pretty slow.
Some clever folks made Mamba, which is just a faster re-implementation and uses a slightly different searching mechanism. Then finally is MiniForge. MiniForge is a minimal installer that uses CondaForge. CondaForge is a community of users who curate their own set of different libraries, and in particular, their own set of compiled files. One of the cool things about Python is that it can call compiled code, meaning zeros and ones rather than nice and easy-to-read letters. This is good for performance, but also makes it challenging to manage all the different types of operating systems and computers that you might have.
You don’t need to worry about this, other than to know that if you use CondaForge, and the package you want is on CondaForge, it will get you the right version. MiniForge is just a combination of the Conda and Mamba tools coupled with the CondaForge channel of distributing code, packaged into one thing called MiniForge. That’s what we’ve just installed.
Using the Command Prompt
Let’s go ahead and use it. We’ll spend more time in the command prompt coming up, but for now, open up the command prompt. My shortcut way of doing this is to hit the start menu and type CMD.
If you’ve successfully installed Conda and you followed the settings that I recommended, you can just type conda and confirm that it works. You’ll see it works because it gives you a nice long list of conda commands. If for some reason it doesn’t come up or can’t find it, type conda init. This just makes sure that your command prompt is aware of the conda tool, but if you followed my steps, you should have it already.
You can also see the handy list of different commands here, and you could explore those and learn more about them.
Getting the Course Repository
Before we create our environment, we need to get ourselves the code that we would like to install. For this, we’re going to make sure that we have a proper folder to clone the course repository into. There are lots of ways you can do that. For me, on my C drive, in users, in my username JA Johns, I have a folder called Files. By convention, I always add something called Files for things that I know that I added. For me, I’m going to put the course repository in Teaching. You might want to do that in Learning or something similar.
I’m going to create a folder called APEC8222. If you don’t have one there, you’ll have to create this. When you first start, I will assume it’ll be empty.
In order to navigate there, we need to make the command prompt point to that location. The classic way would be to use the CD (change directory) command and go into Files. If you forget what it is, you can just type DIR for directory. It shows you what the things in that directory are, so you might remind yourself that it’s Files/Teaching. You can chain these together too, but I’m just doing one at a time to show you: cd Files, then cd Teaching, then cd APEC8222.
Now, let’s suppose that the repository wasn’t there. If I wanted to get it there, that’s where GitHub is going to come in. I’ve already linked on our webpage the link to our GitHub page. You need to copy that URL.
For the assignment, I’m actually going to ask you to make a fork. To make a fork, when you go to my username in this repository, you should see an option under Code to fork it. What that will do is copy this repository into your username, and so you now have total control over it. You can change it or delete it, and it won’t change what I’ve got over here.
Once you do that, navigate into your new repository. You’ll be on a URL that looks like this: github.com/your-username/APEC8222-2025. That is the URL I want you to copy.
Once you have that copied, make sure that you’re in your APEC8222 folder. Notice it’s a little different: this is the class directory, and into that, we are going to clone the class repository. A little confusing, but it’s necessary. The repository is the one that has the year on it.
We’re going to use our Git that we just installed to do this. First, let’s just test that we have Git working. Type git. If you see a bunch of options similar to Conda, we can see what are all the options available. Now that we have your URL to the fork that you just created, type git clone and then paste the URL to your repository. That would mean that instead of my username, JAndrew Johnson, it would have your username here.
When you do this, you will know you’ve done it correctly if you have the APEC8222-2025 folder in your class directory. Poking around, that’s where we’ve now got all these files that were on the GitHub page, but now we used Git to put them on your local machine rather than the remote server.
Creating the Environment
You changed directories into there, then you Git cloned it. Now we’re going to create a new environment. You could create your own environment, but we’re actually going to use a pre-built definition of the environment that I’ve already built for you. The details are stored in the environment.yaml file.
In order to do this, first you need to change into your class’s root directory with cd APEC8222-2025. That’s an easy one to skip, because when we were in the class directory and we cloned the class repo, it didn’t change us into the class repo. So we’re going to do cd APEC8222-2025. Now we’re in the repo.
If we look, we can see there’s a .gitignore file, so we know we’re in the right place. Now that we’re there, we can use this command to install a whole bunch of software: mamba env create -f environment.yaml -n your-environment-name. The -f option says use whatever set of Python packages are defined in this file, and the -n is indicating that you want to give it a custom name, not just the name that I’ve already given it. You can replace “your-environment-name” with whatever you want to name it. I recommend calling it ENV8222A for now.
One thing to note is the word Mamba here. You notice for everything else, we’ve been using the word Conda. As I mentioned earlier, Mamba is the super-fast approach to installing all these things in the environment file, so make sure to use it here. Usually, you can use Mamba and Conda interchangeably, but this is the one where it really matters. But I recommend switching back to Conda just for cross-compatibility for everything where you don’t actually need the speed.
We didn’t technically need to be in the class directory if we were to create a new environment, but for this case we do, because that’s the only place where the environment.yaml file is stored. If you want to take a look at what that looks like, open the file. What we have here is a whole bunch of libraries that we’re going to install, along with the specific versions that work well with each other. All of these have thousands and thousands of lines of code each, so it’s hard to appreciate the full extent of awesomeness that you get when you install a large Conda environment.
When you run this command, it’s going to install everything in that environment file. This command might take up to 10 minutes to run for you, because it is downloading gigabytes of code. Be very patient with it and don’t close the terminal early.
Once it’s done, we need to activate that environment. Type conda activate ENV8222A (or whatever you named it). You’ll know it worked because now we see the name of the environment in parentheses in front of our path. Now, whenever we run something, like run Python itself, it’ll be using this distribution of Python.
Just for confirmation, let’s type python. Now we have a very new version, 3.13, packaged very freshly for us by the wonderful folks at Anaconda and MiniForge. Here’s where we’d be able to do Python. It’s a really good calculator, but you probably want to get out of it. These triple sideways carrots (>>>) indicate that you’re in a Python prompt, but you might want to go back to the command prompt. For that, the built-in Python way of doing it is quit() with parentheses, and now we’re back out. We’re still in our environment, but we’re back to where we were before.
Installing VS Code
The last tool for today is to install VS Code. Go to the download page and select the User Installer 64-bit option for Windows. The default options are pretty much okay as is, but I have one that I use all the time called “Add Open with Code Action to the Windows Explorer file context menu.” What that does is it just gives you a handy way to open up a folder in VS Code. If you have that option, when you right-click on a folder, it gives you this “Open with Code” option, and we’ll see that’s pretty handy later on.
Once you’ve got it downloaded, go ahead and launch the IDE however you like. An IDE is called an Integrated Development Environment. It’s a super useful, super powerful text editor that also allows you to do lots of extra coding things, like run the code, inspect the code, or easily navigate among different parts of the code. You’re probably familiar with an IDE from R and RStudio. That’s an IDE too, but it’s specific to R.
VS Code is the most used IDE across all of the languages, and this also now includes a remarkable growth among R users themselves, who are switching from the RStudio world to the VS Code world. There are tons of advantages, and a couple disadvantages of doing this. It’s got unnecessary complexity that you don’t necessarily need to learn if you’re just going to be running regressions. But here, our goal for this course is to go beyond just that.
You can do all sorts of things, like change color schemes on that launch page. I used to like light color schemes, but I’ve since changed to dark color schemes.
The critical thing I do want you to do is link this VS Code application with your GitHub. We’re going to use GitHub to do a lot of things, like Git code for us, and we can actually do the git clone command directly in VS Code, which can be a lot more user-friendly. It’s very important to understand the command line because some things are only implemented there. But for the rest of the course, we’re probably going to do most of our Git in the VS Code environment.
To enable that to work, you need to enable Settings Sync. Looking at VS Code, if you go to the Accounts section at the bottom, I’ve already signed in, but if you haven’t, that’s where it’ll give you an option to enable Settings Sync. There are other ways to bring this up too, but one way or the other, you’ll select GitHub. It’ll prompt you to sign in. Make sure you’re selecting “Sign in with GitHub.” That will pop up a browser that asks you to authenticate this application having your trust, in other words letting GitHub take control of it, and you’ll say yes. That means that your VS Code now has Git enabled in it.
Just previewing where we might go, one of the things that means is now you’ll have this tab called Source Control. It was there before, but it didn’t point to any repository that it could find. Now you’ll be able to do things like clone code and push code from here directly. That’s pretty handy, and we’ll say a bunch more about that.
The last thing is I would like you to install the VS Code Python extension. For that, click on the Extensions button and just search “Python.” It might feel a little odd that VS Code doesn’t have Python built in. That’s because it works for any language, and so you actually need to configure the different ones. But there are super well-supported extensions for the obvious things like Python. This is the official one from Microsoft, and there are tons of other awesome tools and stuff you can install there. We’ll learn about a few more later on.
Assignment 1
Once you’ve got all this, you are now ready to take a look at Assignment 1. Navigate to the website and find the assignment. There are two basic parts. Number one is essentially what I just walked through: install your environment for our class repository by following the instructions in the 01A slides.
Then move on to part two, which is deciding on the topics for this class to cover. We talked about this in class, and I’ve now updated the website. On the homepage, you’ll see the topics that we talked about before, and then we have the additional topics. A lot of you are really interested in causal machine learning. I’m really interested in computer vision and remote sensing for economics, but there are a whole bunch of possible topics.
All of these have really good papers as well as a good Python implementation of them. Do look through those, and as the assignment asks you to do, write one to two paragraphs and submit it into Canvas describing which ones you are most interested in. I’ll also offer you the option: if there’s anything you’re not interested in and you feel really strongly about it, you could go ahead and put that too. You’re also welcome to suggest additional topics that are not already listed, but it will require that they have an implementable GitHub version of it.
After you’ve written up your paragraph, also go ahead and copy and paste your GitHub fork of the class repository. This can be private; I just need this so I know what your username is and that you actually got things installed in the right place. Personally, I recommend making it public, and that’s because a lot of people will judge your coding skills by how long-lived and detailed your GitHub profile is. Some people ask me once in a while why I don’t have a LinkedIn page, and my answer is anything that I really want to promote myself on, they’d be better off going to my GitHub page and really seeing what I’m doing rather than a polished version.
You’ll simply go to Canvas and submit that there. Feel free to email me before next class if you have any questions, but ideally you have all of this submitted before class starts on Thursday. Thank you, everybody.