In our first blog, we used a regular expression to replace the quotes in genres
. Afterward, we were able to UNNEST() the JSON object. We’ll be working with the same data set in this blog
In our data:
Embedded content: https://gist.github.com/nfarah86/ef1cc9da88e56226c4c46fd0e3c8e16e
there is a JSON string that’s called spoken_languages, and it’s formatted similarly to genres:
[ { "spoken_languages": "[{'iso_639_1': 'fr', 'name': 'Français'}]" }]
Assuming everything is consistent, we can just write the SQL statement similar to what we wrote for genres
— right?
Wrong ⛔️
We actually get a parsing error:
json parse error on line 0 near `x9akai”}, {“iso_’: unknown escape 120 in string
So, we sort of know what the culprit is here, but we don’t know if there are going to be more parsing errors. Let’s go ahead and debug this, so we can get this SQL statement working!
We can actually use REGEXP_LIKE() to see what exactly is causing the error:
SELECT
spoken_languages
FROM
commons.TwtichMovies t
WHERE REGEXP_LIKE(t.spoken_languages,'x9akai')
This is a sample of what we get back:
This is great— it looks like the \ are actually causing some issues. We can use REGEXP_REPLACE() to replace those slashes with an empty string:
SELECT (REGEXP_REPLACE('Lietuvi\\x9akai', '\\',''));
Now— to put it all together: How do we UNNEST() spoken_languages and fix the 2 issues at hand (the string format and the double slashes)?
Hint:
SELCT REGEXP_REPLACE(REGEXP_REPLACE('[{''iso_639_1'': ''lt'', ''name'':''Lietuvi\\x9akai''}]', '''', '"'), '\\', '');
I’m sure you can take it to the finish line from here, but just in case, you can watch this youtube link down below to catch the full replay!
Embedded content: https://youtu.be/8aHgJrQjT4U
TLDR: you can find all the resources you need in the developer corner.