CS634 – Homework 3
Released Feb 22, Due Mon., March 5
Being a DBA in a cloud VM, Indexing, Intro Query Evaluation
Problem 1 Mysql in the cloud VM.
Follow the instructions in MysqlInCloud.html to bring up your own mysql installation in the Google cloud. Send the email to eoneil as specified there. Show the "ls -l" output showing the contents of your data directory (like slide 8 of class08.) Also show "df" output and interpret it to find out how many GB of disk you have. Unfortunately we can't determine much about the physical disks from Linux. They are described in general here. Clearly we have restricted use of a shared RAID. Use "top" or /proc/meminfo to find out how much memory you have, and /proc/cpuinfo to find out how many processors, and how many cores you have. Report on your findings (you don't need to show details here).
Problem 2. User privileges and mysql
quirks.
a. As part of the installation of problem 1, you created a user named by your Linux username, and a database name <username>db in your cloud mysql. Now modify your cloud mysql user to have the same user privileges as your mysql user on pe07's mysql installation. Show the command(s) you used to get your old privileges from pe07's mysql and the commands you used to set up the cloud mysql user and database on your own cloud installation.
Note: Because your root user did "grant all privileges on joedb.* to joe@'%' identified by 'joe'" during your work in problem 1 (replace joe here with your own username), you will now need to start off by revoking that high level of privilege on your database by having your root user do "revoke all on joedb.* from joe@'%';"
b. In your cloud database, create tables named t1 and T1 to illustrate mysql's non-standard case-sensitive table names. Find and quote the line in pe07's mysql configuration that overrides this behavior, and try to create these two tables in your pe07 account. Report your findings. You can leave the default behavior--we need to know how to live with it.
c. Explore another unfortunate default mysql configuration: the default collation is caseless, so sorts (including sorts for index entries) are done with 'A' = 'a', etc. Prove this by creating a small two-column table T with a PK column of string type. You can use your own mysql installation or the database on pe07 for this, as they both have the default collation. The table T has of 5 or 6 rows, with different strings as PKs, with the strings chosen so that they have different orderings by case-sensitive vs. caseless comparison. Also a second column of ints. (Wrong:Note that mysql must allow 'x' and 'X' as different keys even though they sort the same.) Show "select * from T" output to see that the mysql PK is ordered by the caseless ordering (in mysql, unlike Oracle, the table is held in a clustered B+-tree index on the PK, scannable along the leaves, so this query shows the PK ordering.) Try your experiment on Oracle. In Oracle's case use "select id from T" where id is your PK. This way, Oracle will see that the cheapest way to evaluate this is to scan the PK index, thus revealing the key order. (For select * from T, it will use a scan of the heap file, not definitive here.) In your paper, show the queries and their output and your analysis.
Problem 3. Loading our standard tables.
a. Load the book's tables into your new installation using dbbook_mysql.sh for the tables in $cs634/tables, as in hw1. You'll need to edit ~/.my.cnf for this, and make the /var/lib/mysql-files directory with the input files (this requires sudo). Report success or problems. Note that this load does require the FILE privilege, which you should have granted to your mysql user as part of problem 2a.
b. Although these tables are too small to really merit indexing, pretend they are larger for now. Study the queries of Exercise 5.2 (the parts on hw1, now available in tables/queries.sql) and determine what indexes could speed these queries up. Any column used in a join is a good candidate. Also any column involved in a selection with equality or range condition. Create files add_indexes.sql and drop_indexes.sql to do these actions, and show them in your paper.
c. Run queries.sql before and after creating the indexes and running "analyze table" on the affected tables. Don't worry if it doesn't show improvement, since these tables are so small. Describe your timing results. Also compare with running these on pe07 (which should be a faster system).
Problem 4. Oracle vs. mysql
Oracle's yelp_db is finally loaded up, except for the tip, checkin, and photo tables. I was able to dump the mysql tables to CSV files (comma separated values) using mysqldbexport, after getting this program using "sudo app-get install mysql-utilities". Most tables were easily loaded using sqlldr from these CSV files. However, the review table is full of entries with newlines (the actual reviews), and Oracle's sqlldr insists on treating newlines as ends of records. So I wrote a mysql query to use string REPLACE on the review text to change newlines to spaces, and output values in CSV format. This worked except for about 16000 records out of 4.7 million, which still failed, so we have 0.3% fewer reviews in Oracle.
a. Let's compare Oracle and mysql for this sizable database. Run each query twice with timing. Although Oracle is using a RAID and mysql only one disk, they both have enough memory for the entire yelp_db database, so a second execution of a query should involve no disk reading. Both databases have the same indexes on FK columns as well as on PKs.
i. Find the number of 5-star reviews for state='NV'. Note that a few reviews failed to load into Oracle, so expect a slightly smaller number here for Oracle.
ii. Find the number of users (#unique user ids) reviewing businesses in 'NV'.
iii. Find the number of users with cool count over 10 reviewing businesses in 'NV'.
b. Write relational algebra for queries in part a (the plan for before the final count operator, since we don't have an RA operator for that) and draw draw possible query plans for these three queries, like Fig. 12.3. Note that the DISTINCT processing is considered part of the projection operator. Suggest three indexes that could speed up these queries.
c. Prove that the Oracle review table is not clustered by its primary key by executing and interpreting "select rowid from yelp_db.review where rownum<=20 order by id". See slide 29 of class04 for format of Oracle ROWIDs. Compare to "select rowid from setq_db.bench where rownum<=20 order by kseq". See Section 10.8.2 to understand how it should look if the data is loaded in PK order, as setq_db.bench is. (This lack of PK clustering is caused by mysql's use of a caseless comparison, so its PK order on these mixed-case keys is entirely different from Oracle's with the same data. The data was exported from mysql in this order and loaded into Oracle. Luckily we don't want to use range queries on review.id, where the clustering would really matter if the data needed to come from disk. This also can impact join performance as we will see in the future. We could of course dump-in-PK-order and reload the table.)
Problem 5. Linear Hashing
Problem 6. Index Matching
Exercise 12.4, parts 2, 3, 4
Problem 7. Sorting
Exercise 13.4 parts 1, 2a, 2d (2a uses single-block i/o, 2d uses multi-block i/o)