A lot of our information is dates and time stamps for sort keys and then distribution keys are using user IDs. The roll-ups are usually by user, by platform, and a few other columns. We’re mostly relational and then we have lots of roll-ups. Marc: I don’t have really too much interesting to say here. We’re inserting in timestamp order, so it’s pretty obvious to sort on the timestamp. But I imagine we could do more post-processing if we needed to make that data more queryable. We do have to deal with some issues, such as flattening fields that came from Mongo, but it hasn’t really been an issue for us, because our data is mostly an array of many strings, and you can just search on it. So, we try to use things like sort keys but we don’t really change how the actual data is structured in Redshift. Nathan: As I mentioned, a lot of our data comes over in a pretty familiar schema we designed within Mongo to describe educational data. Subscribe to our blog to be notified when it’s online.) (Chartio’s AJ Welch is working on a blog post about Interleaved Sort Keys. With interleaved, you can choose any of those keys, or only one of them, and scan across your data performantly. We have an interesting new feature called Interleaved Sort Keys. Redshift can deal with a lot of different schemas but as long as you do your sort key and your distribution key properly. So, when you filter on it, Redshift doesn’t have to scan the whole database to find it.īut if you don’t have a sort key, and you haven’t done things like vacuum your tables when you’ve added new data, you may do a filter and all of a sudden it takes forever because you’re scanning billions of rows to find stuff in the last day. We have metadata, for every single block on disk that tells us what the min and max values are for that block. We use zone maps, based on your sort key. The problem with that is we don’t have a traditional index in Redshift at all. I see customers who don’t have sort keys, so they have performance issues. If you don’t have a sort key, and you haven’t done things like vacuum your tables when you’ve added new data, you may do a filter and all of a sudden it takes forever because you’re scanning billions of rows to find stuff in the last day.ĭistribution keys and sort keys are also critical. But if you have a star schema, we’ve written a detailed blog post about running star schemas on Redshift. So if you don’t specify a distribution key, we’ll do an even distribution across all the nodes, which can work very well too. So you don’t have to pull data from another node to do your join.īut, that might not always work for you, because you join on different things, so Redshift’s default is to try to balance out the cluster and distribute evenly. For example, I can distribute my data on customer ID and when I do a join on customer ID, all the data is co-located on the same node. We have something called a distribution key. Many of our customers do different things, very performantly, so we see star schemas.Ī star scheme is great in some ways, but the downside is that you have to do a lot of joins and Redshift can do those very well if you tune it properly. Tina: I usually tell customers to start with whatever schema they’re already using. Tina Adams, AJ Welch, Nathan Leiby, and Marc Bollinger What’s your approach to schema design? You can also watch the video of the roundtable if you prefer. It was moderated by Chartio’s VP of Growth, Jordan Hwang. We held the roundtable at a meetup at WeWork in SOMA in San Francisco.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |