▼
20 years ago today, the TV show Mythbusters aired its first regular episode on the Discovery Channel, after three earlier pilot episodes. And soon after, I had found a new favorite TV show.
What Mythbusters did between 2003 and 2016 was investigate whether (urban) myths, viral videos, movie stunts and idioms had some truth to them or not. The usual outcomes were: confirmed (results replicated without issue), plausible (results replicated but possibly some liberties taken) and busted (results could not be reasonably duplicated). If a myth was busted, they would often see what it took to replicate the results, going to extremes as needed.
I think the first episode I saw was "penny drop". That was the seventh episode, with the main myth being that a penny (one cent coin) dropped from the Empire State Building would kill you if it hit you down on the street. The big question here was how fast a penny would be traveling after being dropped from the iconic skyscraper.
At this point, we only had the two main hosts, Jamie Hyneman and Adam Savage. Adam had built a rig with a tube with holes in it and air being blown in from the bottom with the airspeed in the tube going down with increasing height from 100 km/h to about 60 km/h. As the pennies stayed in the tube, their terminal velocity must match a speed between those two numbers.
With the always exuberant Adam annoyed by the usually stoic Jamie:
Jamie:
Works for me.
Adam:
This is excited Jamie? This is a world first going out on television that we showed exactly how fast a penny goes. No math, no ideas, no timing off a building, we got it right here in the lab.
Jamie:
It's great. Good job.
Turns out those falling pennies aren't lethal. Adam and Jamie both have one fired at their hand with no damage, and...
Adam:
Come on, shoot me in the ass, come on. I can take it.
The interaction between the hosts was what made Mythbusters so funny.
That dynamic wouldn't change very much over the following almost decade and a half. For most of the run of the show, the "build team" consisting of Grant Imahara, Kari Byron and Tony Belleci would usually bust separate myths, but sometimes the main and secondary hosts would team up.
They really caught lighning in a bottle here. The interactions between the hosts were almost always fun, and as experienced builders with some notion of the scientific method, the Mythbusters weren't completely lost when it came to attacking myths, but they were also not so knowledgeable that they could predict the results very well in advance.
A few more quotes:
Adam:
I reject your reality and substitute my own.
Jamie:
I love it when common materials are dangerous.
Narrator Robert Lee:
This thing that looks like a maximum security bird house, is actually the vital fulcrum.
Adam, about to go through a carwash with Jamie on the roof of his car:
As long as we're testing methods for removing unwanted pests from the outside of your car, we couldn't finish this without going through the carwash.
Tory:
It's not the prettiest thing but I think it's going to work.
Kari: That’s what they said when they hired me.
Great stuff.
Permalink - posted 2023-09-23
▼
After looking at the SQLite Unicode behavior, it's now time to do the same for MySQL. Coincidentally, I'm currently migrating some old databases that were created in the very early 2000s to a more modern environment. I think those old databases were from the MySQL 3.x days, before MySQL gained any sort of Unicode support. Those old tables are thus still in the latin1 (ISO 8859-1) character set.
For a long time that worked fine for me as these tables contain data in English and Dutch, and Dutch only needs a few accents, which are all present in Latin 1. However... at some point it started becoming more convenient to use UTF-8 for the text in those databases. So I did. I just never told MySQL about the switch.
In hindsight, that was not a terrible decision, as it avoided complexities at a time when UTF-8 support was still immature. But these days, there's really no excuse to do anything other than make an entire workflow UTF-8 clean, barring any showstoppers.
So I migrated an old table to the test environment for the new system. And got some really weird results: on web pages "CO₂" showed up as "COâ‚‚", but in MySQL it showed up correct, be it that the number of characters is off (I only got 6 while I specified 8):
mysql> select substr(article, 1587, 8) from muart where id = 753;
+--------------------------+
| substr(article, 1587, 8) |
+--------------------------+
| CO₂ ui |
+--------------------------+
Further digging by looking at the raw data:
mysql> select hex(substr(article, 1587, 8)) from muart where id = 753;
+-------------------------------+
| hex(substr(article, 1587, 8)) |
+-------------------------------+
| 434FC3A2E2809AE2809A207569 |
+-------------------------------+
Fortunately, it's not necessary to decode this manually, there is an UTF-8 Decoder web tool for that. The decoder shows that COâ‚‚ is correct. So why is MySQL showing me CO₂? That's a problem I hadn't heard of before, but is apparently not uncommon:
Double-encoded UTF-8.
This happens when you take UTF-8, such as "CO₂", and then pretend it's another encoding (usually Latin 1) and convert that to UTF-8. So what had happened is that as I was importing my data in MySQL, the MySQL command line client would send UTF-8 to the MySQL server process, but the server would think that was Latin 1 and convert it to UTF-8. Then when I did a query, the server would convert the UTF-8 back to what it thought was Latin 1 for the convenience of the client, and the client then showed this as UTF-8 so everything looked good, but was actually stored in the database incorrectly.
Fortunately, the two related problems were both easy enough to fix. First, make sure the client and server agree on the character encoding. Let's first check what the server's original settings are:
mysql> SHOW SESSION VARIABLES LIKE 'character_set%';
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
Wow. But fortunately we don't have to fix all of those individually. We can simply do:
mysql --default_character_set=utf8mb4
And then:
mysql> SHOW SESSION VARIABLES LIKE 'character_set%';
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
Much better.
And convert the double-encoded UTF-8 into something more readable:
mysql> UPDATE muart SET title=CONVERT(CAST(CONVERT(title USING latin1) AS binary) USING utf8mb4);
Keen-eyed observers may have noted that MySQL has two versions of UTF-8: utf8mb3 and utf8mb4. This goes back to the early days of Unicode, where the idea was that all characters would fit into 16 bits. That results in a maximum of 3 bytes of UTF-8. But it soon became clear that 16 bits wasn't enough. So now it's 21 bits. UTF-8 can handle that just fine, but those characters that need 17 - 21 bits result in 4-byte UTF-8 sequences. So when dealing with MySQL, when you think "UTF-8", type "utf8mb4".
Permalink - posted 2023-09-21
https://en.wikipedia.org/wiki/European_ordering_rules
But as the transition from 8-bit characters to 16-bit characters was already well underway in the Windows world and in languages such as Java and Javascript,
# mysql --default_character_set=utf8mb4 muada
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 98
Server version: 8.0.34 MySQL Community Server - GPL
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> SHOW SESSION VARIABLES LIKE 'character_set%';
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
DUCET
https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
https://www.unicode.org/reports/tr10/
http://www.unicode.org/Public/UCA/latest/allkeys.txt
https://en.wikipedia.org/wiki/European_ordering_rules
https://en.wikipedia.org/wiki/ISO/IEC_14651
mysql> insert into test (id, text) values (5, concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)));
Query OK, 1 row affected (0.00 sec)
mysql> insert into test (id, text) values (6, concat('Zoe', char(12*16 + 12, 8*16 + 8 using utf8mb4)));
mysql> insert into test (id, text) values (7, 'Zoe');
mysql> select * from test order by text collate utf8mb4_bin;
+----+-------+
| id | text |
+----+-------+
| 1 | ZOE |
| 4 | ZoA |
| 7 | Zoe |
| 6 | Zoë |
| 3 | Zoy |
| 5 | Zoë |
| 2 | zoe |
+----+-------+
7 rows in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_bin) as sortweight from test order by
text colla
+----+-------+----------------------------+
| id | text | sortweight |
+----+-------+----------------------------+
| 1 | ZOE | 0x00005A00004F000045 |
| 4 | ZoA | 0x00005A00006F000041 |
| 7 | Zoe | 0x00005A00006F000065 |
| 6 | Zoë | 0x00005A00006F000065000308 |
| 3 | Zoy | 0x00005A00006F000079 |
| 5 | Zoë | 0x00005A00006F0000EB |
| 2 | zoe | 0x00007A00006F000065 |
+----+-------+----------------------------+
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test order by text collate utf8mb4_0900_ai_ci;
+----+-------+------------------------+
| id | text | sortweight |
+----+-------+------------------------+
| 4 | ZoA | 0x1F211DDD1C47 |
| 1 | ZOE | 0x1F211DDD1CAA |
| 2 | zoe | 0x1F211DDD1CAA |
| 5 | Zoë | 0x1F211DDD1CAA |
| 6 | Zoë | 0x1F211DDD1CAA |
| 7 | Zoe | 0x1F211DDD1CAA |
| 3 | Zoy | 0x1F211DDD1F0B |
+----+-------+------------------------+
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test order by text collate utf8mb4_0900_ai_ci desc;
+----+-------+------------------------+
| id | text | sortweight |
+----+-------+------------------------+
| 3 | Zoy | 0x1F211DDD1F0B |
| 1 | ZOE | 0x1F211DDD1CAA |
| 2 | zoe | 0x1F211DDD1CAA |
| 5 | Zoë | 0x1F211DDD1CAA |
| 6 | Zoë | 0x1F211DDD1CAA |
| 7 | Zoe | 0x1F211DDD1CAA |
| 4 | ZoA | 0x1F211DDD1C47 |
+----+-------+------------------------+
mysql> select *, weight_string(text collate utf8mb4_0900_as_ci) as sortweight from test order by text collate utf8mb4_0900_as_ci;
+----+-------+------------------------------------+
| id | text | sortweight |
+----+-------+------------------------------------+
| 4 | ZoA | 0x1F211DDD1C470000002000200020 |
| 1 | ZOE | 0x1F211DDD1CAA0000002000200020 |
| 2 | zoe | 0x1F211DDD1CAA0000002000200020 |
| 7 | Zoe | 0x1F211DDD1CAA0000002000200020 |
| 5 | Zoë | 0x1F211DDD1CAA0000002000200020002B |
| 6 | Zoë | 0x1F211DDD1CAA0000002000200020002B |
| 3 | Zoy | 0x1F211DDD1F0B0000002000200020 |
+----+-------+------------------------------------+
mysql> select *, weight_string(text collate utf8mb4_0900_as_cs) as sortweight from test order by text collate utf8mb4_0900_as_cs;
+----+-------+--------------------------------------------------------+
| id | text | sortweight |
+----+-------+--------------------------------------------------------+
| 4 | ZoA | 0x1F211DDD1C4700000020002000200000000800020008 |
| 2 | zoe | 0x1F211DDD1CAA00000020002000200000000200020002 |
| 7 | Zoe | 0x1F211DDD1CAA00000020002000200000000800020002 |
| 1 | ZOE | 0x1F211DDD1CAA00000020002000200000000800080008 |
| 5 | Zoë | 0x1F211DDD1CAA0000002000200020002B00000008000200020002 |
| 6 | Zoë | 0x1F211DDD1CAA0000002000200020002B00000008000200020002 |
| 3 | Zoy | 0x1F211DDD1F0B00000020002000200000000800020002 |
+----+-------+--------------------------------------------------------+
ê
mysql> select *, weight_string(text collate utf8mb4_0900_bin) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_bin;
Empty set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_as_cs) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_as_cs;
Empty set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_ai_cs) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_ai_cs;
ERROR 1273 (HY000): Unknown collation: 'utf8mb4_0900_ai_cs'
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_ai_ci;
+----+-------+------------------------+
| id | text | sortweight |
+----+-------+------------------------+
| 1 | ZOE | 0x1F211DDD1CAA |
| 2 | zoe | 0x1F211DDD1CAA |
| 5 | Zoë | 0x1F211DDD1CAA |
| 6 | Zoë | 0x1F211DDD1CAA |
| 7 | Zoe | 0x1F211DDD1CAA |
+----+-------+------------------------+
5 rows in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_ai_ci order by text utfmb4_0900_as_cs;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'utfmb4_0900_as_cs' at line 1
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_ai_ci order by text collate utfmb4_0900_as_cs;
ERROR 1273 (HY000): Unknown collation: 'utfmb4_0900_as_cs'
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 10 using utf8mb4)) collate utf8mb4_0900_ai_ci order by text collate utf8mb4_0900_as_cs;
+----+-------+------------------------+
| id | text | sortweight |
+----+-------+------------------------+
| 2 | zoe | 0x1F211DDD1CAA |
| 7 | Zoe | 0x1F211DDD1CAA |
| 1 | ZOE | 0x1F211DDD1CAA |
| 5 | Zoë | 0x1F211DDD1CAA |
| 6 | Zoë | 0x1F211DDD1CAA |
+----+-------+------------------------+
5 rows in set (0.00 sec)
mysql>
mysql>
mysql>
mysql>
mysql> select *, weight_string(text collate utf8mb4_0900_bin) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)) collate utf8mb4_0900_bin;
+----+------+------------------------+
| id | text | sortweight |
+----+------+------------------------+
| 5 | Zoë | 0x5A6FC3AB |
+----+------+------------------------+
1 row in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_as_cs) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)) collate utf8mb4_0900_as_cs;
+----+-------+--------------------------------------------------------+
| id | text | sortweight |
+----+-------+--------------------------------------------------------+
| 5 | Zoë | 0x1F211DDD1CAA0000002000200020002B00000008000200020002 |
| 6 | Zoë | 0x1F211DDD1CAA0000002000200020002B00000008000200020002 |
+----+-------+--------------------------------------------------------+
2 rows in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_ai_cs) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)) collate utf8mb4_0900_ai_cs;
ERROR 1273 (HY000): Unknown collation: 'utf8mb4_0900_ai_cs'
mysql>
mysql>
mysql>
mysql>
mysql> select *, weight_string(text collate utf8mb4_0900_bin) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)) collate utf8mb4_0900_bin;
+----+------+------------------------+
| id | text | sortweight |
+----+------+------------------------+
| 5 | Zoë | 0x5A6FC3AB |
+----+------+------------------------+
1 row in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_as_ci) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)) collate utf8mb4_0900_as_ci;
+----+-------+------------------------------------+
| id | text | sortweight |
+----+-------+------------------------------------+
| 5 | Zoë | 0x1F211DDD1CAA0000002000200020002B |
| 6 | Zoë | 0x1F211DDD1CAA0000002000200020002B |
+----+-------+------------------------------------+
2 rows in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_as_cs) as sortweight from test where text = concat('Zo', char(12*16 + 3, 10*16 + 11 using utf8mb4)) collate utf8mb4_0900_as_cs;
+----+-------+--------------------------------------------------------+
| id | text | sortweight |
+----+-------+--------------------------------------------------------+
| 5 | Zoë | 0x1F211DDD1CAA0000002000200020002B00000008000200020002 |
| 6 | Zoë | 0x1F211DDD1CAA0000002000200020002B00000008000200020002 |
+----+-------+--------------------------------------------------------+
2 rows in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_bin) as sortweight from test where text = 'zoe' collate utf8mb4_0900_bin;
+----+------+------------------------+
| id | text | sortweight |
+----+------+------------------------+
| 2 | zoe | 0x7A6F65 |
+----+------+------------------------+
1 row in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_as_cs) as sortweight from test w
here text = 'zoe' collate utf8mb4_0900_as_cs;
+----+------+------------------------------------------------+
| id | text | sortweight |
+----+------+------------------------------------------------+
| 2 | zoe | 0x1F211DDD1CAA00000020002000200000000200020002 |
+----+------+------------------------------------------------+
1 row in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_as_ci) as sortweight from test where text = 'zoe' collate utf8mb4_0900_as_ci;
+----+------+--------------------------------+
| id | text | sortweight |
+----+------+--------------------------------+
| 1 | ZOE | 0x1F211DDD1CAA0000002000200020 |
| 2 | zoe | 0x1F211DDD1CAA0000002000200020 |
| 7 | Zoe | 0x1F211DDD1CAA0000002000200020 |
+----+------+--------------------------------+
3 rows in set (0.00 sec)
mysql> select *, weight_string(text collate utf8mb4_0900_ai_ci) as sortweight from test where text = 'zoe' collate utf8mb4_0900_ai_ci;
+----+-------+------------------------+
| id | text | sortweight |
+----+-------+------------------------+
| 1 | ZOE | 0x1F211DDD1CAA |
| 2 | zoe | 0x1F211DDD1CAA |
| 5 | Zoë | 0x1F211DDD1CAA |
| 6 | Zoë | 0x1F211DDD1CAA |
| 7 | Zoe | 0x1F211DDD1CAA |
+----+-------+------------------------+
mysql> show collation like '%utf8mb4_uni%';
+------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | PAD SPACE |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE |
+------------------------+---------+-----+---------+----------+---------+---------------+
mysql> show collation like '%utf8mb4_0900%';
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
| utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD |
| utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD |
| utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD |
+--------------------+---------+-----+---------+----------+---------+---------------+
Permalink - posted 2023-09-16
▼
The other day, I landed on this article: In Focus: Subsea Network Architecture: IXPs. The article takes some time to arrive at the point that undersea internet exchanges would be a good idea. The most eyecatching part is a variation on this image:
As the article starts out discussing how datacenters have been moving away from large cities to take advantage of opportunities such as space, cheap energy and easier cooling, this image seems to suggest that these blue dots in are good locations for datacenters and/or internet exchanges in general. And that's definitely not the point of the
paper
that the image is from.
That paper is very specifically about the best locations to place servers for high speed algorithmic trading on multiple markets some distance away from each other. This immediately explains why there is nothing around the western US: there are simply no stock exchanges / markets there (the red dots in the image).
The math looks more complicated, but presumably, in these cases it helps when the servers executing the trading algorithms are in the middle between the "users", rather than close to one and further from the other(s).
If you need data from two places far away from each other, then it's better when each is 25 milliseconds away, as you can then complete your action in 25 ms plus however long it takes to do your own processing. If you're close to one so it's 0 ms for one data source and 50 ms for the other, then the entire action takes at least 50 ms.
But is that a common situation?
In general, you can just copy the data beforehand. So this only applies if you're using "live" data from two or more locations. Videoconferencing with a number of participants could be an example, where a server receives the video from all the participants, mixes it into a single feed and then sends that single feed out to all the participants. If the server is in the middle, this limits the maximum delay. I guess that could be somewhat helpful. But to the degree that it makes sense to have datacenters in the middle of the ocean? I'm not convinced.
Permalink - posted 2023-09-07
▼
In this post, I want to have a look at how SQLite interacts with Unicode. (Also see my post The (dark) magic of Unicode.) As explained here, SQLite doesn't have full Unicode support unless that support is explicitly included when SQLite is compiled.
So what does this mean in practice?
On the (non-Windows) systems I have access to, SQLite uses UTF-8 to store text. UTF-16 is also supported. However, a limitation that's made very explicit is that changing from upper case to lower case or the other way around is only supported for 7-bit ASCII. So the 26 letters we know from the English alphabet.
Let's see for ourselves:
% sqlite3 test.db
sqlite> .mode qbox
sqlite> create table test (id integer primary key, text text);
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', 'O', 'A'));
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', char(0xD6), char(0xC5)));
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', 'O'||char(0x308), 'A'||char(0x30A)));
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', 'U', 'A'));
We've now added four variations of the word "SMORGASBORD" to our test database table. And that worked:
sqlite> select * from test;
┌────┬─────────────────┐
│ id │ text │
├────┼─────────────────┤
│ 1 │ 'SMORGASBORD' │
│ 2 │ 'SMÖRGÅSBORD' │
│ 3 │ 'SMÖRGÅSBORD' │
│ 4 │ 'SMURGASBORD' │
└────┴─────────────────┘
The first version uses an unaccented O and A. The second version uses the "composed" version of the accented letters. That means that a single Unicode character / code point encodes the letter/accent combination. The third version uses the regular letters followed by a "combining diacritical mark", a special Unicode character that adds an accent to the previous character. And then finally I added "SMURGASBORD" which will come in handy later.
(The combining diacritical marks throw off the SQLite text length / box size calculations, hence the misplaced vertical line at the end of row 3.)
Those upper case letters are a bit overbearing, though, so let's convert them to lower case.
sqlite> update test set text = lower(text);
sqlite> select * from test;
┌────┬─────────────────┐
│ id │ text │
├────┼─────────────────┤
│ 1 │ 'smorgasbord' │
│ 2 │ 'smÖrgÅsbord' │
│ 3 │ 'smörgåsbord' │
│ 4 │ 'smurgasbord' │
└────┴─────────────────┘
And here SQLite's lower()/upper() limitation pops up. The first and last rows are converted without trouble because those only use the letters out of the regular Latin script alphabet. The second one uses letter/accent combinations outside of the 7-bit ASCII range so those letters are not converted to lower case. Interestingly, the third row did get converted. That's because the letter part is encoded using the regular O and A, which SQLite can convert to lower case. The diacritical combing marks that add the accents then apply to the now lower case o and a without missing a beat.
So the take home message is: don't let SQLite convert between upper and lower case.
In any event, the rules for how to do this properly vary by language. For instance, an English speaker may think the following makes sense:
istanbul = Istanbul
ijsbeer = Ijsbeer
één = Één
But that would be wrong. In Turkish, there is a dotted i and an un-dotted i, the ı. Ok, that's not so bad. Until you realize that i becomes İ and ı becomes I. And in Dutch an "ijsbeer" (polar (ice) bear) needs to become "IJsbeer" at the start of a sentence because IJ is considered a single letter. And, unlike for instance the French, we don't put accents on capital letters, so "één" (one) at the beginning of a sentence becomes "Eén". So these are correct:
istanbul = İstanbul
ijsbeer = IJsbeer
één = Eén
Also, because the composed and decomposed variations of smörgåsbord use different sequences of Unicode characters, if you compare the same word with those different encodings, they won't match. But that part can be solved by normalizing your Unicode string to either composed or decomposed, and then comparisons will work.
But then there's sorting to worry about.
sqlite> select id, length(text) as chars, octet_length(text) bytes, text from test order by text;
┌────┬───────┬───────┬─────────────────┐
│ id │ chars │ bytes │ text │
├────┼───────┼───────┼─────────────────┤
│ 1 │ 11 │ 11 │ 'smorgasbord' │
│ 3 │ 13 │ 15 │ 'smörgåsbord' │
│ 4 │ 11 │ 11 │ 'smurgasbord' │
│ 2 │ 11 │ 13 │ 'smÖrgÅsbord' │
└────┴───────┴───────┴─────────────────┘
(We can see here that "length" shows the number of Unicode code points. "octet_length" is the length in bytes, which is different as soon as a text contains anything that isn't 7-bit ASCII.)
Considering what we discussed before, this makes perfect sense: "smo" and "smo" come before "smu" (the fact that the second "smo" is followed by a diacritical combining mark is not relevant when looking at the first three characters) and then "smÖ" comes after "smu" because Ö is a "high-ASCII" character while "u" is just a regular 7-bit ASCII character.
Of course this is not really a correct sorting, as that again depends on the language/locale. In some languages, letters with accents are considered letters in their own right with a specific place in the sorting order, while in other languages, the accent is simply ignored and äbc comes between abb and abd. And that's just for "general" sorting.
Name/phonebook sorting often has its own rules. For instance, so many people here in the Netherlands have a name starting with "van" (of) or "van de" (of the) that these prefixes are generally ignored for name sorting. And many names have an "ij" and a "y" form. So "van Beijnum" is found under B, and is sorted together with "van Beynum".
So these caveats are something to be aware of. But there is an escape hatch: you can compile SQLite with SQLITE_ENABLE_ICU to gain full Unicode support. But this makes SQLite a good deal bigger and slower. Dilemmas...
Permalink - posted 2023-09-05
▼
When I was 24, I decided to give up my job and go to college and study computer science. If I'd have known how many database classes that involved, maybe I would have reconsidered.
Back then, we had a big server that ran a RDBMS (relational database management system) that hundreds of students all used together. These systems were big, complex and expensive. (Oracle made its fortune selling RDBMSes.) MySQL and PostgresQL are somewhat more streamlined free and open source RDBMSes. Much better, but firewalling, user authentication and backups are still somewhat of a headache. But hey, if you need a database, you need a database.
Enter SQLite.
Traditional RDBMSes are a service that you connect to over the network—even if the RDBMS runs on the local machine. SQLite, on the other hand, is just a (pretty small) library that you link to in your software, and all the database contents go into a single file stored on the local file system. But you still get a very complete implementation of SQL, the "structured query language" that is used to interact with most databases.
I find myself dealing with CSV files and the like pretty regularly. For just getting data in and out of applications, this is usually fine. But doing anything useful with CSV data in your own scripts or applications is a chore. With SQL, on the other hand, you can easily analyse and transform data, often with pretty simple SQL queries.
Compared to MySQL
In the past, I've installed MySQL on my laptop to some data analysis. Having to install a big piece of software and configuring a user account and permissions is less than ideal. With SQLite you just point to a local file and you're in business. If you know what you're doing, SQLite can also be a lot faster than many other database engines such as MySQL. (Although without grouping large numbers of insert statements into larger transactions SQLite will be slow.)
Se the page Appropriate Uses For SQLite on the SQLite website to learn when SQLite is a good fit and when it isn't. The main thing is that SQLite is not a good fit when the application needs to access the database over a network.
Using SQLite with PHP
Initially, it seemed that moving from MySQL to SQLite in PHP would be tricky. However, the basics are no issue at all. This is the simple way to use MySQL in PHP:
$db = mysqli_connect("localhost", "db_user", "secretpw", "db_name");
$result = $db->query("select * from pages where pagenum = $n");
while ($row = $result->fetch_assoc())
echo json_encode($row);
$db->close();
That translates relatively neatly into SQLite:
$db = new SQLite3("content.db");
$result = $db->query("select * from pages where pagenum = $n");
if ($row = $result->fetchArray())
echo json_encode($row);
$db->close();
The C API requires a few more steps to get things done, but doesn't look excessively complicated.
However, MySQL and SQLite use different SQL dialects. So far, my main issue has been the missing left() and right() functions and the handling of dates/times. Especially adjusting a given date is done in rather different ways.
SQLite from the command line
You can directly interact with SQLite databases through the sqlite3 program that is installed by default on a good number of operating systems. You can download the Windows version from the sqlite.org version. And there's even an Amiga port. Sweet. Just type "sqlite3" followed by the filename of a database file and you're in business. SQLite 3 is open source, and interestingly, has no license: it's entirely in the public domain.
A features in SQLite that I really like is that you can read blobs from the file system in order to store them in a database table, and then later write those back to the file system:
insert into blobstore (id, data) values (1, readfile('Makefile'));
select writefile('blob.bin', data) from blobstore where id = 1;
And another great feature of the command line sqlite3 program is that you can just call an editor to edit the contents of a field. This is especially useful if you store larger texts in a database. It works like this:
% sqlite3 test.db
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> update blobstore set data = edit(data, 'vi') where id = 1;
(Of course in general you'd probably edit text columns, not blob columns.)
Portability and easy backups
As mentioned before, SQLite stores an entire database, holding one or more tables and all the associated housekeeping, in a single file. So you can just copy that file to another computer, email it to someone, or use standard backup mechanisms to back it up. (Ok, it seems that backing up the database file while changes are being made is not 100% bullet proof, but it's close.)
The SQLite3 file format has been backward compatible since 2004, and it's 32/64-bit and big-endian/little-endian agnostic. This means you can just email a copy of a SQLite database to someone else, and they will be able to use it without trouble. And you can be fairly confident that several decades from now, it's still possible to retrieve data from an SQLite file.
SQLite has some interesting Unicode caveats, but I'll save those for a later post.
Permalink - posted 2023-09-04