Lecture 01 - Introduction

Welcome to our first lecture of big data. I think we still have a good number of people coming, so there may have been some challenges with the weather and the traffic, but that’s okay, because we’re going to start with a round of introductions, and hopefully they will filter in. Before we dive right into those, there’s a bit of context to this class. A good number of you just finished taking my 8602, Applied Earth Economy Modeling class.

We’ve been using a lot of these tools and approaches already, and so they’re going to have a slightly different basis, but we’re going to assume that nobody has that basis. They just might be extra fast at it, but if you hear any extra questions that don’t quite make sense, that’s the context. Also, many of you took 8221. Can you all raise your hands if you took 8221?

Okay, so everybody, good. That’s the other bit of context that I don’t have, because I wasn’t there. I’m sure Allie did a great job teaching you about R, and so this course, 8222, is going to build directly from there, but it’ll also touch on maybe some content in 8602.

Also, a quick note to those who did attend my 8602 class: apologies, there’s going to be a couple of things that we will redo, such as the first assignment, getting you up to speed on Git. That will be very easy for you, but we’ll very quickly get into new material.

I’ll introduce myself first, but then I want to go around the room and have you all introduce yourself with your name, what program you’re in, and how far along you are in that program, which of the courses that I just mentioned you took, and the favorite programming thing you learned, either in those courses or in general, if you didn’t take those courses.

So starting with me: Hi, I’m Justin Johnson. I’m an associate professor here in Applied Economics. I just got tenure the beginning of this semester, and so I’m feeling pretty happy being here. My main interests are environmental economics, but really, we’re switching to global sustainability and earth economy modeling, so you’ll hear me use that phrase a lot, earth economy modeling, even though it’s not directly relevant to this course. The application of the research that I do looks at Earth and economy modeling together, but much of that requires machine learning, AI, and a bunch of other stuff that we’ll learn here. One last context note: I think in subsequent years, I really want to have this class come before the 8602, so there’s not this sort of weird sequencing. But I think we’ll make it through, so feel free to ask me any questions.

For the rest of today, what I want to do is just quickly talk about the syllabus. Then talk about the general question of what is big data and what does it offer to economists, but then very quickly transition into a pair of very tightly related questions: What is the relevance and extension to machine learning? What is the extension to artificial intelligence?

But then what I want to do is spend a good amount of time walking through the schedule that I had before, which apparently I made a lot of duplication mistakes on, but that’s because I want to co-create it here together. It’s going to be talking through what are those concepts. I’ll actually walk us through them, talking about the things that we did learn last time I taught this class two years ago, but I’ve also spent a huge amount of time collecting a wide variety of additional topics and additional sources that I want to talk through. Part of what Assignment One, which is actually going to be due on Thursday, is going to have you do is, in addition to getting your GitHub set up, which is really easy, look through all of the topics, either from the existing stuff last time I taught this, as well as the new stuff, and just write a paragraph on which ones you personally are most interested. I want to do this because then we’re going to craft the schedule together. The reason for this is AI, machine learning, big data—all these things are moving so fast that the stuff that I taught in 2023 is woefully out of date. Not really. Some of the stuff is foundational and won’t change and is still relevant and useful for economists, but what’s fun about this is being able to learn the latest and the greatest, and so I will try to facilitate a discussion on that.

I probably won’t have time to talk about the research topics that I have. I’m going to move that to the end, and maybe as something we skip for this year, because I want to spend a little bit more time answering up front: How does this area connect to econometrics? Being applied economics, we obviously have a lot of focus on econometrics, and even those who aren’t in our department, I’m guessing you also have some interest in it. We’re going to talk about what’s different here. We’ll spend a lot of time on that.

A few notes just on the syllabus to get us started. First off, I didn’t print it out, and that’s because this course leverages the course website. Canvas is a pretty thin shell that just points to the course website, you’ve probably discovered. The one thing we will use it for is keeping track of grades, so you actually do the submissions there. I think I’m required to do that, and I don’t want to get in too much trouble.

But regardless, most of the day-to-day stuff will happen on the course website, and I update that very, very frequently. Let’s actually just take a look at it.

Maybe another thing to note is I’m also juggling in-class participation here with—there’s always a good number of people online, so hey everybody online—there’s a lot of people that are wanting to take this information, not necessarily for credit, but just because it also tightly relates to the research aims of the research center that we have here in Applied Economics called NATCAP Teams, the Earth Economy modelers. They’re learning also things relevant that might come from this course. But that means I’m going to be constantly switching around screens, and apologies in advance that I’m not very good at remembering which screen I’m sharing.

Okay, so now I think I got the right one, and hopefully everybody here already saw the course website here. This is the one linked from Canvas. A few things to note: we’ve got the GitHub repo. We’ll learn a lot about Git, even more than in our last 8602 class. The syllabus lives here.

This is the contract, the state of what you got yourself into. That’s what a syllabus is. But you’ll note that the schedule is just approximate, and this is the one that has the duplications, so you can see there’s not actually two different fourths of November. That’s just in the syllabus.

What is more up-to-date is this existing schedule that we’re going to talk about. The course page will also have the link to the page for the course here. Readings, when they become assigned, will be linkable here. The slides—we actually have two different slide decks listed today. That’s slides 01, that’s what we’re going to talk through today, and then 01A, or I should have called it 01B, is going to be a set of slides that I give only via a video that’ll be up on YouTube pretty soon. That’s just one for you to do at home to get Python installed on your computer, so we don’t have to sit here and wait till everybody downloads a bunch of stuff.

Assignments: I’m going to update this. I thought I did, okay. I was editing this right before class, and so you’ll see after class, I’ll send out the final one. It’s going to be these steps plus that extra bit of information which I talked about, which was a paragraph describing which of the additional topics you are most interested in, so we can create the final schedule together.

Any questions on the website?

[Question about Python installation]

So the question was, do you need to install Python again? You don’t need to install Python, per se, but you will need to manage separate environments. The whole process of using Conda, Anaconda and the distribution of libraries there—we’re going to have a separate environment for this course. The assignment walks through that. That will be because there’s different libraries used in machine learning than there are in the spatial analysis that we did there.

Any other questions?

[Question about repositories]

Should we make it as a private and invite you as a collaborator, or make it public? That’s up to you. I personally do everything I do public, but I know a lot of people don’t like to do that, and that’s fine. If you want to keep it private, yeah, just invite me to it. That’s the easiest way. Actually, for this first assignment, I don’t even really need the invite. You will be creating a final repository for this course. That’s the one that I’ll need eventually. But for now, just send me the GitHub URL. Even if I can’t access it, that gets me what I really need, which is your GitHub username. So that’s why we’re doing it this way. Sound good?

[Discussion about Git usage in previous courses]

Oh, okay. Oh, I didn’t know that she did Git. She didn’t do Git in the previous year. So she spent a fair amount of time with Git? Cool, okay. Then that will be easier this year than most, so that’s good to hear.

Any other questions? This is a mishmash of old information and new information, and I didn’t clarify it, and so I’m going to clean this up, send out an update when that’s done, but then we’ll update it again with our decisions about what information, which lectures should be included.

Okay. But without further ado, then, let’s switch over to a little bit more discussion about what we did have in the syllabus.

[Technical setup discussion]

Okay. So, let’s talk about this.

Success in this field increasingly requires mastering software and code. Applied economics itself is shifting in this direction, and in many ways, we’re ahead of a lot of other fields, especially more theoretical econ departments. We’re applied, and that sort of naturally lends itself to code, and so we tend to be data analysts. This shift is one that I think we’re doing quite well at. I think this is a really positive way that we are distinguished from more standard, theoretical, unapplied econ programs.

My friends over on the West Bank in the standard econ program really don’t like it when I call us applied econ and them unapplied econ. They prefer just to be econ, but all in good fun. There is a big difference in emphasis here, very much focused on the application to data, and that’s what we’re going to do.

In this course, we will primarily use Python, and that’s what distinguishes it from the other courses that you may have taken, and also the common language used throughout most applied economics courses, and even econometrics and statistics in general. R, the programming language R, is dominant in econometric analysis. This is why it’s the basis of our department’s core coding approach. We have an R boot camp, and then we have Allie’s R course. R is a wonderful language, and if you just want to do econometrics, I would completely recommend it over Python.

That may change at some point, but it is a purpose-built language really good at statistical analysis, so not too surprising. It’s basically just faster to learn, I would say. It’s simpler if all you’re doing is statistical analysis.

However, when you want to do something else, or if you want to do something on really big data, or you want to do machine learning in general, these things can’t be done as easily in R. There’s always a possibility that somebody has taken the Python code and converted it to R, but it’s still just a secondary process. The fact of the matter is, other disciplines, and very particularly machine learning and generative AI and the billions and billions of dollars of investment recently in large language models and everything are 100% focused on Python. Sometimes there’s a person who will port a library in the machine learning Python ecosystem over to R, but those are not very common and quickly get out of date or are unsupported. Essentially, if you want to learn machine learning, you have to learn Python. There’s no way around that.

But what I would say in general is that in your career, I don’t think Python will be the only language you learn. Really, the goal is to become bilingual, or better yet, multilingual in languages. Once you learn one language, learning the second language is much, much faster. This is even more true of programming languages than actual spoken languages. I think if you learn one language, like Chinese, and then try to learn Spanish, there’s not a whole lot of acceleration there. But if you learn R and then Python, you probably will learn Python twice as fast. If you learn Julia after Python, you’ll probably learn Julia five times as fast, maybe four times as fast, so we can have a nice scaling law.

But the reason for this is the syntax differences are pretty small. The fundamental way to think about coding is the hard part. For the first few days, it probably feels like the syntax is all that matters, and that actually is the hard part, but what I will say is you quickly get past that, and then it’s the question of how good are you at the concepts? And those are the same among all languages.

Fortunately, we will not assume that anybody here has any Python experience, although there actually is a fair amount in the room. We’re going to walk through from the very beginnings, and that’s what’s going to be on the virtual lecture that I’ll link to the YouTube video, which I’ll make this afternoon. We’ll walk through the installation of Python, and then we’ll talk through, in the first few lectures, things like language basics, and aim towards then applying it ever more to machine learning models.

So the readings: everything is freely available. We are basing it on a couple of books. Number one is The Elements of Statistical Learning. This might seem odd, because it’s such an old book, 2009, but this is the one where we’ll get some of the core elements for where statistics started to split a little bit, where econometrics went one direction and machine learning went another direction. This book does a great job of spanning both. It’s available there. But then we’ll also have more updated stuff. This really should say 2025, because it’s a living website for the book. Arthur Turrell is an economist, and he writes Coding for Economists, available on his website. That’s one of many possible books you could choose to get into Python, but I like it because it’s very much from the perspective of an economist, and so hopefully is a little bit quicker to pick up for somebody coming with that econometrics and R background.

There’s tons of other books, and actually most of the readings that we’ll really have will not be from these textbooks. It’ll be from the latest and greatest articles that have recently come out in machine learning, where we’ll try to replicate different things. But that’s what we got.

Office hours: whenever. I have an open-door policy, so just look at my UMN Google Calendar and see if there’s a good slot for you. I’m happy to meet.

Evaluation: this class will be 70% assignments and 30% class participation. I think that’s actually a pretty heavy bias, or at least more than average, emphasis on class participation, and that’s because we’re going to do this together. I’ll be very frequently filtering around class, looking at everybody’s screen, and if you just have all the stuff installed and you’re working along at the typical pace, you’ll get all your class participation points. We’ll work on this a couple of times, but if you struggle with it, that’s fine. Just let me know as soon as possible, so that we can have the goal of having everybody on the same page. Once we all get out of sync with each other, it’s nearly impossible to teach coding. So that’s what class participation covers.

Yeah, I was supposed to download the UMN-approved AI syllabus language. This is not it. We’re literally going to be running and using ChatGPT and learning some of the basics of transformers, so not too surprisingly, I’m fine with any and all usage of large language models.

The other thing I’d add, and this is actually in the syllabus, is I don’t even require attribution, because that’s like saying, I wrote this on a Dell, or I wrote this on an Apple. This is just core technology, it doesn’t matter. However, flaws are still your own. If large language models make you say something stupid, I will judge you the same whether you wrote it yourself or the large language model wrote it for you. That’s a little different in emphasis from how most people try to approach AI. I think many people try to prevent it or consider it cheating, which I think is absolutely ridiculous. It’s not going away, even if you wanted it to. Much more, it’s important to master it.

I’ll say a lot more about that, including a robust discussion in a later lecture about the pros and cons of AI, specifically from the perspective of somebody trying to learn to code. The preview of that is, if you’re already good at coding, AI is all win-win-win for you. However, if you don’t know coding yet, it’s a real pitfall in the sense that you might not learn the fundamentals, because it is so easy to use. So I’ll come back to that.

Computers: so I’ve said this before, it’s all Python, freely available. We’re requiring, or strongly recommending, that you bring your own laptop. It’s not strictly required, but I would love it if you all had administrator rights to your computers, meaning not using a university-managed computer. If you’re using a personal computer, that should not at all be an issue. Reach out to me ASAP if you’re using a university-owned computer. Also, reach out to me if you don’t think you’ll be able to bring a suitable computer, or you have an ancient one, or you start to realize that you can’t even pull the class repository on Git because you don’t have enough memory. That would be something we’ll have to work out together very quickly. I have a few loaner PCs, I don’t like to use them, though.

It’s possible to do this on PC, Mac, or Linux. All the examples will be in PC, though. It’s not that hard to transfer over.

And that’s it for the syllabus. Any questions about the syllabus? There’s a lot more out there, boilerplate on academic freedom and scholastic honesty. I just kind of copy and paste. But any questions?

Good, because I want to dive in. There is a lot to do.

So let’s start at the basics. What is big data?

Not surprisingly, big data means many things to different groups of people, but if there was a standard definition, it’s just the idea that data sets are so large or complex that traditional data processing applications are inadequate.

This increasingly is important, where we have streams of data, for instance, a stream of video collected by a self-driving car. That’s data, but it’s a lot bigger, because each frame is going to be over a million pixels, and you have 60 frames per second. That’s a whole lot of observations. That’s a lot more than 2,000 households in a Tanzanian household demographic survey, or something like that. You immediately need to be able to think at larger data scales.

Consumer data is huge. This is one of the older forms of big data, in fact, but linking all of your purchases to your credit cards—you’re probably not surprised, but Target knows way more about you than you would expect. That’s why they’re willing to give you a 5% or whatever boost it is if you use Target Circle. That’s because it essentially cleans their data and joins it in a much more effective way so they can use it more effectively and sell it to other people. So they’re not being nice here. This is a profit-maximizing decision.

We’ll talk about this one, remotely sensed data, a lot. Satellites or drones are now constantly taking pictures of Earth, and this generates very interesting data, but they’re also very big.

Maybe the last type of big data is just traditional data, but really big. If you download all of the FAO stat database, for instance, that’s 600 megabytes. That’s big enough where you would not be well advised to just simply load the whole thing. If you can’t load the whole thing, that’s what we’re going to call big data.

I got into a debate also. Some people argue that the technical definition of big data is that it’s high-dimensional, and so this means you can’t express it just in two-dimensional spreadsheets or even three-dimensional panel data, but that it’s the dimensionality of data that defines it as big data. That’s not how I’m going to regard it. I think that’s a different challenge, but it’s not related to what we’re talking about.

Yeah, and so there’s many related subfields, and how they use the data also constitutes what is big data, and so machine learning, which would be core, and artificial intelligence. But also, just the idea of what is big data depends on the technological advances that we see in computer science and hardware. Stuff that used to be big data a decade ago is not big data anymore, because you can just load it all into memory on your computer, and it’s not slow.

The final type of subfield that we will constantly reference, though, is econometrics, exactly as we’ve done before, but just with bigger tables. That’s an important one for understanding as we leap away from things that fit in your computer memory all at once to things that can’t.

And that leads me to not just the general topics, but why should economists care about big data?

The easiest one to say is, number one, what happens if you have too many observations? That leads to thousands of years of runtime. That’s not a great PhD strategy, it’s not a good use of your computing resources. If you have tons and tons of observations, and my rule of thumb is if you’re ever waiting more than about a minute, unless you’re 100% confident this is your last run, you should not be doing it that way. You need to figure out the reason why it’s going slow, and speed it up. We’ll get into that.

Another reason you might need to think about big data as an economist is: have you ever inverted a matrix? X prime X inverse? Well, what happens when n is super huge? What if n is, say, 1 billion? Can you fit that in your computer memory? Absolutely not. There are, however, all sorts of tricks—essentially looking at subsets of the invertible matrix, or other decomposition approaches, which we’ll talk about, which can deal with that and still give you the basic benefits of ordinary least squares.

Another reason you should care about big data is it is what is enabling many of the new approaches, functional forms, or improved prediction ability that we see coming out of AI in general, but machine learning overall.

But I do want to talk about the risks related to that. That’s going to be one of the concepts, I think, that I actually have more emphasis on this year than previous years—some of the downsides of big data, machine learning, not just from causing academic problems, like spurious correlation or false claims, but also from an ethical point of view. Where is there going to be real problems that arise in society? I’m not going to look at the ones like, oh, is AI going to get rid of all of our jobs? That’s a legitimate question. I’m instead going to look at it from the perspective of might AI itself cause us to think about things in a biased or negative way? We’ll talk more about that.

But finally, a lot of economic research is just going in this direction. It’s kind of hard to come up with a new, good, publishable topic on an existing database, because there are ever more PhDs or researchers out there almost comprehensively looking at all the different ways you can look at household survey data. If you collect your own data, that’s an obvious way to publish well, but taking existing household survey data is not sufficient. I mean, I shouldn’t say not, but it’s just a lot harder to find a really novel question if you didn’t collect the survey data yourself. But in general, there is a ton of new research that is more on the idea of, let’s take household survey data and connect it with new types of data, data that we might make.

For instance, I know you did web scraping in 8221, that’s a good example of it, but more conceptually, what is the cross-sectional X sub IK of a tweet? That’s something that we’ll talk about. You can make tweets into data, and that would be an interesting way to answer novel questions. You might want to pair that with existing data sets or approaches. But anyways, this is just where the field’s going, so being at the forefront is the main reason, I think, to care about this.

Okay, so just talking through a few of those examples, just to make it fresh. I decided, I put the subtext in here, historical examples kept in for hilariousness. This field is changing so fast that even the relevant memes that I made two years ago are no longer funny. But I’m going to keep them anyways, partially because it takes a long time to come up with new ones, but partially because it also illustrates the trajectory of this course.

One example of new data would be looking at voice analysis. How could you write algorithms that go from an input, which is a data-generating process, which is essentially a time series of amplitudes of different frequencies? That doesn’t have much meaning, but we can see pretty obviously how we get it at text. But increasingly, with natural language processing, we can get that into meaning.

Image analysis is going to be another one we’ll spend a long time with, and this is actually where machine learning and AI really got its start—taking data, actually literally these data, these are famous data, of handwritten numbers, and they’ve been rasterized, expressed as a matrix here. Figuring out, can we analyze images like this to get meaning, such as, which number is that? And categorization, thus, is a very important thing.

Really, categorization more generally can even refer to dogs and cats. That’s another one of the classic machine learning things to do, and we’ll play around with that. Who’s heard of generative AI, though?

That’s kind of a buzzword. What’s generative AI? What about instead of just categorizing things, instead of categorizing cats and dogs, what if we created cats or dogs? That’s what generative AI is—the idea that we cannot just do classification, but use those same structures to come up with new, generated versions of things. This is one of the ones that is now out of date, but this was two years ago, the website’s still up, but TheseCatsDoNotExist.com was implementing a generative adversarial network to generate cats.

Actually, if you’re really deeply involved in, at the time, the Twittosphere, what this really was, was a joke. There was a more famous website called This Person Does Not Exist, and what it did was had a scary, accurate set of images of people that were created from a generative AI, a generative adversarial network specifically, that looked really real. They spent a lot of time on that. This website was making a joke on that website, and they spent less than a day making this website, and so it also illustrates some of the downsides of doing machine learning incorrectly. Like, what do we see?

Scary, terrifying abominations of cats. If you actually go to the website, some of them look a lot better. I just collected a few that are particularly nightmare-inducing, or even—they didn’t clean their input data set very well, so what do you think this is? These aren’t words, but why do you think it’s formatted that way?

That’s a meme! It’s approximating a meme, because the datasets about cats tend to include lots of meme text, and so they did not properly clean that, so I just think it’s kind of funny, and I have no idea what this one is. But yeah, you can check out that.

I’ve mentioned this already, but sentiment analysis. There’s a lot of interesting applications of doing even stock market investing strategies based on what companies did famous people mention in a tweet, and was it positive or negative? You might expect that if President Trump were to negatively talk about a specific company, maybe their stock will go down. If you can analyze the Trump Twitter or Truth Social stream in real time, you might be able to make an investment choice right after the statement is done, but before the value change has happened. So that’s an interesting question.

But for me, really, this is the one that matters the most—the terabytes and terabytes, petabytes of information that get generated per day by these satellites zooming around our Earth, taking pictures of what’s going on. We’ll talk about some of the really important articles, like, can this be used to assess poverty? Actually, did Allie talk about that concept?

Okay, good, because I want to talk about it. Basically, the question of, can we infer from remote sensing satellites, presence of different types of roofing material? Is it metal? Is it something else? Or the density of the road network? From these things, can we assess economic factors like poverty? This will give us information that you simply couldn’t collect from census information, because that’s always going to be reported at the administrative scale, whatever that is, but we can actually get it down to specific pixels, specific houses and businesses, and that’s fun.

But environmental factors are also coming frequently from remote sensing sources like this.

We’ll talk about two different types of data pretty systematically throughout this. I think you did see these in 8221. Number one, what I’ll call raster data, and that’s just going to be a two-dimensional matrix of values. Did Allie talk about that, I assume? Excellent. So you’ve got some background, you probably used the raster package. Is that what she used this year? Or did you switch to Terra, or something like that?

Okay, that’s cool. We’ll talk about that, but in the Python ecosystem, I might follow up on that, because I’m actually curious.

But then also thinking about how to connect raster data with vector data. Vector data, that’s the one that would, for instance, take something like a household survey data and link that to spatial and spatially explicit definitions. You might say, a household has a specific latitude-longitude. They’re always going to fudge it by quite a bit for privacy reasons, but that would be essentially a traditional statistical input of cross-sectional question asking, but nominally linked via vector data to a specific point, or maybe polygon or such.

Once you get it in vector form, you can easily do additional analyses that we’ll talk about, like, how can you maybe connect it with raster values to see how does something like land cover change poverty outcomes, would be an example.

I’ll very frequently talk about maps like this, which are land use, land cover, and I’ll just frequently say LULC maps. This is the national land cover database for the United States. It’s a very common input to lots of environmental things that I talked about last class, but also it’s super relevant for even people who don’t care about the environment, or who actively dislike the environment. Sorry, I’m going to make that joke again. There’s a difference there. But you might be able to extract from information—extract information from data like this that’s relevant to all sorts of other questions, like health or labor market participation, or things like that. That’s because you can zoom.

Just illustrating the high-res nature of this, these are 30 meter grid cells. Let’s keep zooming. So now we’re starting to see categorized data like this. These are the different colors, refer to different land covers.

You might ask, why do I care so much about that? I already mentioned it’s the primary input to a lot of environmental economic models, but more generally, I would say it’s because it says a lot about individual decision making, and economists can work on that. So, for instance, this farmer here, how do I know that’s a farm? Well, that’s what farms look like, at this resolution. Trust me, you start to—I won’t get into that. But the point is, this is a farm.

This person, living there in particular, we could use environmental models to figure out, but I would just jump to the conclusion of this person has a really big impact on the water quality of this watershed right here. Much more than somebody over here, for instance, who’s farther away. And moreover, the choice of should we plant buffer vegetation in this location is going to have a very large impact on how much this farmer influences the water quality in this watershed. So that’s all spatial information, and it gets to really interesting questions.

This is big data, so let’s just keep on zooming even farther. The point is, all this underneath the hood is just a big old matrix. You could make a land use land cover map by having an Excel sheet and making the tile square. That’s how I made this example here. But the point is, we’ll talk about big data, generally geotifs, that can store this information at global scales with high resolution.

Why not raster data in a CSV file? That’s a really good, deep computer science question that I could talk for an hour on, but let me just give you the 30-second answer. The information could, in principle, be stored with the same amount of storage space required if it was in CSV or in a two-dimensional thing. The difference is, when it’s in a two-dimensional array, it actually keeps extra information, such as who are the nearby grid cells. If you flattened this into a single, one-dimensional CSV, two pixels that might be right next to each other in different columns are not going to be proximate to each other now in the flattened dataset.

There’s a lot more I could say about that. But a lot of optimization, when you get down to the bare metal speed of your machine, comes down to how and where in your memory are the data stored, and if they’re nearby, they tend to be faster to do something on. That’s not going to be on the test. Which is funny, because there’s not a test.

So, coming back up a scale towards the more classic questions of statistics, one example that we’ll start with pretty early on is, what if you just want to run a regression, but instead of on that CSV of one-dimensional cross-sectional data, what if you want to run it on a stack of rasters? That’s a sort of straightforward question, I think. Like, what if you wanted to say, could we predict land use land cover from similarly defined rasters of things like soil depth, or plant-available water content, or topography? We’ll talk about the specific methods to do raster stack regression. It’s really just regular regression, but organizing the data in a way that you can do it in gridded format. So we’ll talk about that.

Okay, so that’s my sort of reason why I think it’s important to think about big data.

But let me transition to a different part of the first lecture here, with a funny historical note on why this course is called Big Data. Six years ago, when I was applying for this position, it was a really attractive-sounding phrase, big data. That’s what economists were like, that’s the new, latest, greatest thing. I was asked in my interview, could I teach that course? And I said, yes.

And so, I then started teaching that course, and it had to have exactly that name. However, I think that word is becoming less and less important. I actually applied to rename this course. And then I’m very unorganized and simply didn’t email it in time, but I would have liked to name this course Big Data, Machine Learning, and Artificial Intelligence for Economists, which is a pretty different, sort of connoted set of information than just big data, but that’s where we’re at.

Who here would not have taken this course if it was named Big Data, Machine Learning, and Artificial Intelligence for Economists, instead of just Big Data Methods for Economists? Anybody?

Really? I might have been more intimidated. More intimidated? Okay, but you’re not, like, you wouldn’t be bothered that it wasn’t just all big data? Okay.

Okay, then I’m—because I’m not seeing any real objections here other than that, it’s a good one, but it’s not like you hate AI or something. Well, okay, there’s a little roll of the eyes. I also kind of hate AI, too. Let’s not get into that yet. All I want to say is, essentially, I’m going to treat this course as if it were named Big Data, Machine Learning, and Artificial Intelligence for Economists, and maybe I would even have named it just Machine Learning and Artificial Intelligence and Programming Methods for Economists.

If you have a problem with that, you can either speak up now or email me in private, but I basically don’t want to be constrained by just talking about big data, okay?

And also, this indicates where we will have much more representation on big data, yes, but how it’s used in machine learning and AI. And then that last one is I almost got to name the whole course just this, Advanced Coding Methods for Economists. So you’ve mastered R, how can we go a little bit more general than that? Let’s take on a general-purpose programming language, like Python, rather than a statistics-focused programming language.

So email me if you have problems about that, but I hope you don’t.

[Question about image data and channels]

A great question. It’s true that remote sensing data in its rawest form is going to be different channels of data, so red, green, blue would be one good example. Actually, it’s hyperspectral. In reality, there’s 15 channels, depending on which satellite you’re talking about, where it’s red, green, blue, ultraviolet, infrared, and then a whole bunch of other different ones. You can actually measure the spectrum in any different way.

But the two-dimensionalization is actually a processing step where you go from the three-dimensional data, so it’s X, Y, but then also three values for red, green, blue, and take that raw information and categorize it down into, as I was using as an example, land use land cover. You’re right, land use land cover does lose information compared to the input red, green, blue data, but it also makes it a whole lot more useful, because now it has an interpretation, like, it’s urban, instead of it’s this amount of red and this amount of green. So, great question.

Okay, so now that we’re all comfortable with me talking about machine learning a whole bunch in this course, and I’m glad nobody objected, because I would have no slides now to talk about, really, let’s talk about what is the difference between machine learning and econometrics?

These are two different disciplines, really, and they sort of evolved separately in a whole lot of ways, and they almost, like, branched off into different cultural groups and developed their own languages. It’s like if a cataclysmic earthquake hit academia and separated it into two islands, and then they had their separate co-evolution. We will actually be learning different languages. But I do think it’s useful to sort of do a crosswalk between the two.

So first off, machine learning, what is it? It’s a subset of artificial intelligence that focuses on developing algorithms that can learn from and make predictions on data, and the key emphasis there is prediction. That’s where it’s really different from econometrics. Econometrics, conversely, I would define as a subfield of economics that applies statistical methods to test hypotheses, estimate relationships, and forecast economic phenomena, but really, it’s causal inference. It’s answering questions of not just can we make a prediction, but can we understand why?

That’s useful if you’re doing policy, because you don’t want to be able to just predict the economy without understanding the why, because then you can’t make better policies.

So there is much overlap and complementarity between these, but unfortunately, the languages between these two cultural groups has diverged quite a bit, and so we’ll have to be careful with our language.

To do that, I want to show off one of the weirdest ways the languages have diverged, with one of the core figures that you learn about in machine learning. It’s this one.

So big data, I’ve already mentioned this, but enables big model complexity. If you have a teeny bit of data, you might have to worry about degrees of freedom. Who learned about degrees of freedom in their statistics class?

Yeah. That’s a—if you’re dealing with that, you’re dealing with small data. That means you’re running out of observations compared to the number of covariates you have, essentially, right? Well, if you’re in the big data world, that’s not going to be an issue, because you have millions or billions of observations, and you probably don’t, although we’ll get to it where you do, want to have billions and billions of covariates.

The point is, with all that data, you can then make your model very, very complex. And what’s the risk of that? Well, let’s think back to ordinary least squares. When you add another coefficient to your regression, what do we know for sure is going to happen about the R squared?

It’ll go up. What about the adjusted R-squared?

Depends. That’s why they introduced adjusted R-squared. It’s a sort of really simplistic way of accounting for the fact that just adding more covariates will always improve your estimate.

This is something where, to the extreme, if you had lots and lots of estimates, you can perfectly predict what’s going on in your data. But there’s a risk to that.

Machine learning folks spend a lot more time thinking about this, developing much more sophisticated metrics of it, rather than just adjusting the R-squared. But doing a very good job of thinking about how does the prediction error, in this case, change as you go from low complexity to high complexity in two different cases.

The first one is the blue line. This is similar to OLS. What happens when you make your model more complex, and that essentially means what happens when you add extra covariates? Your prediction error, your R-squared, is going to go and constantly get better and have less error. That’s not surprising.

What machine learning focuses intensely on is the fact that, yes, it may be true that within the data you used to train your model, it gets better and better, but when you consider it on unseen data, what we’ll call the testing data instead of the training data, it very frequently has this exact shape, where at the beginning, adding model complexity does the same thing for both the training and the test data—it improves your prediction error—but then at some point, it starts to curve back up, and this is where you have essentially memorized the data in the training sample, and you have not figured out general relationships.

When you apply that to unseen data, the performance, even with the many, many covariates, gets worse than if you had fewer covariates. This is something that is just fundamentally true, and there’s much more emphasis on it in machine learning rather than in econometrics.

But I do want to note the difference in language, and I got tripped up on this the first three times I lectured this class, I think, is that machine learners use this word variance. That sure sounds a lot like the word variance that we use in econometrics. Variance is the variance of the data. No, that’s not what they mean here. Variance here means how much variation will there be in prediction quality between the testing sample and the training sample. So this red line here, it goes up back towards the high variance. We’re not saying anything about the distribution of the data, we’re saying that it varies a lot from the training data.

Isn’t that a weird difference? Another one you might notice: bias. We have bias in econometrics too, right? Well, they use that word in a different way. Bias is a measure of how the testing and training sample differ. We have high bias here, where you didn’t make a good prediction on your training sample, and low bias here, where you were not biased on your training sample.

We won’t get into the finer details of these linguistic debates, but it is worth noting that there’s sort of these different languages going on here. Look, I literally had that as the next slide. There’s sort of a complementarity here. You could talk about it this way, in terms of the complexity-variance trade-off. That’s what this graph is.

The next one is the complexity, fit-overfitting trade-off. This is the same fundamental phenomenon. As you go from low-complexity models, like a single covariate in an OLS, your accuracy is low, and it’ll be low whether you do it on the training data, the blue, or the testing data, the green. But as you get more and more complexity, you do hit a point, the sweet spot, where the accuracy, as measured on the unseen testing data, hits a maximum.

The training data quality will continue to go towards perfect as you go to having an ever more complex model, but the sweet spot is here because it’s not a big flex to say you can fit the data really well. That’s just something you can automate. What is a big flex is being able to predict stuff that you didn’t already see.

That is having the right level of model complexity, where we hit this sweet spot where it does good on both the training data, there’s no bias, but also, it does good on the unseen data. There’s low variance, using the machine learning lingo.

Does that get to what you were asking?

You’ll hear a lot of these words thrown around, like, the problems of overfitting. You can just think to yourself, that’s probably related to this, of the variance bias trade-off, but I think probably this term is more frequently used. If you overfit your data, you will get a really high R squared, but it won’t be able to predict anything at all on data that wasn’t used to train it.

We’ll dive into this. This one looks complex, but I just want to indicate where we’re going. Big data, like I said, really—and machine learning really does focus on this question of splitting the training and testing data so that we can not just have a good fit of the training data, but have good out-of-sample predictiveness.

They go a whole lot farther than just splitting it into training data and testing data, and really what we’ll use constantly throughout this course is called cross-validation.

Cross-validation is where you have multiple ways that you split your training and testing data. You first do it once, so that you have some testing data that was not seen by any part of the algorithm. You have to do this before anything is done on the data, otherwise you might have leakage. But we’re going to set this aside, and it can’t be used in any of the model calibration steps.

But that leaves us as training data. This is where we want to be able to come up with the best possible model. Essentially, if you are going to set aside this testing data, we’re going to p-hack this really hard, to use the language of econometrics. P-hacking sounds like a bad thing. And it is, if you’re only using the training data. But if you have withheld the testing data, then p-hacking is not necessarily a bad thing.

Really what we’re going to do is we’re going to break the data into lots of different testing and training splits. These are called different splits with different folds of the data, and we’re going to train a model on this training data, and assess it based on this withheld data, and then we’ll do it again, but with this withheld data, and do it again, and again, and again. What we’re going to do, and we’ll talk through the details of this when we get to specific algorithms, but we’re going to systematically do this to train the model, and we’re going to overfit like crazy. We’re going to p-hack this perfectly. We’re going to come up with the literal best overfitting that we can come up with.

But we’re not going to validate it or establish that it’s good based on that fit. You’re instead going to say, how well does it do when you compare it, then, to the unseen test data.

That’s the fundamental difference, really. If I were to—if you were to give me one word to describe the difference between machine learning and econometrics, it’s cross-validation. Some econometricians do use cross-validation, but it’s really uncommon.

That’s just—it’s a different focus, there’s less data, they’re interested in having a really good fit on just the training data. And that’s because causal inference and things like this are done that way. They’re not predictive. So, yeah.

[Question about standardized approaches]

Is there a standardized—yes, there is. We’ll talk about that, and the real answer is a lot of that’s automated away by some of the software packages we use, but yeah.

Any other questions? I’ve made a lot of strong claims that other people in this department would maybe disagree with, even. Anybody feel uncomfortable about this idea of p-hacking is okay? No, you’re all okay with that?

Eva, you look a little—we’ll see, okay. I guess I did claim a solution to it, so it’s a little unfair to—yeah.

[Question about double descent]

If you are very careful to never actually let that testing data make it into the training part, no, I don’t think that’s a concern. But we’ll actually return to this, and I hesitated to answer your question for a second, because this graph here had been true of machine learning for much of its history, but in 2022, November 12th, it changed.

There’s something we’ll talk about later, double descent, yes. And so we will talk about that. First, we’re going to learn it from the old-school way, where it’s like this, where you can’t overfit the data so long as you’re having your testing always withheld from your training. But we’re actually going to move into the world of things like ChatGPT, where it actually has, in many cases, more parameters than data. That’s not quite the right way of putting it, but we’ll return to that as my direct answer to this.

Okay. Oh yeah, I sort of already got excited and said all these things on this slide, but yes, normally in econometrics, we cheer when our p-values are tiny, right? Like, the worst thing that can happen to you is a 0.11 p-value or something like that. You can’t quite publish it at the 90% confidence interval.

We would cheer when our p-values are tiny, but in big data, our p-values are almost always tiny. And, like, I mean really tiny, like, beyond the precision of our 64-bit floating-point operations on our computer for assessing how low the p-value is.

I see papers where they report a p-value of 0.0000001, and that’s not a good thing. If they do that, it simply means they should have cross-validated. There are very few phenomena that you can assess in an econometric way that have a p-value insanely close to zero, where you’re probably also not committing the overfitting problem.

So is that a good thing that our p-values are finally always tiny? We’ll introduce new metrics to make up for that, and that basically comes to the out-of-sample prediction quality. It’s been around forever, but the idea of cross-validating has been around forever, but it’s really with the advent of big data that it becomes possible, because it’s very data-hungry. So the downside of cross-validation is you need a whole lot of data for it to work in the first place. If you have a small data set of 200 household survey observations, you can’t do cross-validation. You’ll just run out of observations really quick. So there’s that.

I’m going to skip this, because we’re actually, like I said, we’re going to have a bigger section now, a full section of this course, probably, on the criticisms of big data. But I guess just to indicate the direction we’ll go: one of the problems with using big data to make predictions is in this field, we’re focused on or even obsessed with the data, the thing that, by definition, happened in the past, or at best, the very moment present that we’re in.

But if the past world had something like racism, or sexism, or something in it, our model will be trained on that. So it’s going to be very difficult then to extract why an algorithm might be making a decision, and it might just be embedding that racism or sexism in a very hard-to-detect way. And so it leads to things that are discussed, such as algorithmic discrimination.

Okay. So now let me spend the last 6 minutes of class. Okay, I guess I’ll go here. So that was ML, machine learning. You might be asking, what’s the difference between ML and AI, right? More people use the word AI now. When I taught this last in 2023, more people use the word ML.

But actually, literally when I was teaching it, when I was teaching the course on neural nets, which we’ll have in this course, ChatGPT 3.5 was released, and this was the keynote moment when large language models and AI in general sort of took over the public consciousness, and so it was absolutely weird teaching this class where, suddenly, it could generate my slides for me, and trying to talk about the tools that do this.

Where does this fit within the rest of the content? Technically speaking, machine learning is a subset of AI. That’s the official definition you’re supposed to say. And that’s because there are other AI approaches out there that do not conform to the machine learning definition, right? The definition of machine learning is, let’s use algorithms that can train themselves on lots of data to make a prediction. But there are other types of AI, and in fact, most of the history of AI was not machine learning, it was instead coming up with rule-based systems, systems of symbolic logic, where you could try to make AI from fundamental axioms of mathematics and stuff, or even just expert systems.

Lately, though, because of large language models, AI now is so closely connected to generative AI and large language models, which are themselves a subset of machine learning, that it almost feels like AI is a subset of machine learning. And so really what it is, is generative AI is pretty much a subset of machine learning, but there was this broader AI.

Just to tell you, the sort of technocracy people that were good at memes at the time, this was sort of the standard take on why you should not care about machine learning right before ChatGPT came out, is the idea that all machine learning is, is just a bunch of ifs and elses. And so, that is the old type of AI, where you have rule-based or symbolic approaches. This is no longer true.

Okay. So, what I’m going to do, because there’s a lot of replication in this, I’m just going to walk through this, but don’t look at the details, and also we only have 3 more minutes. But in that first assignment that I’ll email out here after class, I’m going to want you to go through this, which is essentially the previous stuff that I taught.

But then also, our course webpage directly. Let me stop sharing—share my screen again. Okay, I think I got it.

Yeah, so this will be updated, like I said, but then there’s this massive section. I spent the last 3 weeks doing this, because my favorite part of teaching a course like this is I have to stay on the very tippity-top, because I would feel embarrassed to not present something from 2025, or something like that. Not quite. But this will talk about a whole bunch of other lists or other topics, rather, that I think could be inserted into the syllabus, and so I would like you to assess them and see which ones you are interested in.

I’ve already heard a few people talk about this. There is this convergence of machine learning and causal inference, and that goes under the Chernozhukov, double or de-biased machine learning approach. That would be one.

Computer vision, modern neural networks and transformers, that’s the technology behind LLMs. Gradient boosting, really powerful. Computer vision, here’s that ethics and fairness. I just have a few readings, there’s obviously many more. There’s some articles about practical tools for doing this.

Time series. A lot of finance folks really like prediction of the stock market, this is really good with ML. The ability to interpret machine learning, to try to move away from a black box approach, which gives a good prediction, but you don’t know why, and instead have a model that can actually be understood.

And then finally, just natural language processing for economics. NLP, which enabled large language models like ChatGPT, is also super useful for generating data that go into traditional econometric approaches.

So, yes. In the last one minute, any questions on today’s content and what I’m asking you to do for next class? I know it’s a quick turnaround, but this is a relatively quick assignment. Any questions?

In the final assignment that I’ll email out in a moment will say it, but basically you’re just going to submit into Canvas, a paragraph of your write-up of which ones you would like to see added. And that’ll be in addition to your GitHub username.

At least a paragraph, but you can write as much as you want. As many as you want.

Well, you have to—okay, here’s the model complexity, quality of result trade-off. If everybody said all of them, it would be as if I didn’t ask anything. So you can optimize according to your own parameters if you’d like.

Fine, four. Maximum 4, minimum 1, maximum 13. Yeah.

No, this is just going to be a discussion, we’ll return to it, and feel free to give me detailed thoughts or anything that you’re interested in. And oh, by the way, I should also say that all of these ones listed here, the 40 or so different papers, are selected very specifically to have both a seminal paper that’s linked, but then also a Python implementation of it. So there’s plenty of good papers out there that, even if they were done in Python, they didn’t have a good replicable Python codebase that combined the data with a replication package. These ones all do.

That’s why I’m focusing on these ones. There may be other ones, but for me to include them, I would want them to have both the good paper and not just the code available, but the code available in a way that somebody could, in less than 5 minutes, replicate the results.

Okay. I think with that, let’s call it. Feel free to email me or just come up and ask questions now directly, and then I’ll email out the assignment as it goes live.

Thank you all!