Nusret Ozates

My summary of the “Career Advice in AI” Lecture

Nusret Ozates — Fri, 19 Dec 2025 21:00:00 GMT

I finally had time to watch the AI Career Advice lesson by Andrew Ng and Laurence Moroney. It was a great lesson for me as I’m about to graduate from my MSc and am ready to join the industry again! I’ve created a summary and wanted to share it with everyone, but I strongly recommend watching it.

In Andrew Ng’s introduction, he mentions two important points. The first one is that, although saying this is (according to some people) politically incorrect, working hard is important for success. But there definitely are some exceptions, like when you have an injury, you just have a kid, and examples like that. The second one is about surrounding yourself with bright minds, high-quality people, both in your personal life and your work life, as you are the average of your surroundings. Choosing with whom you work is more important than where you work. Additionally, he mentioned that AI makes engineers faster, but the ones who also listen the user feedback, communicate with other people will be the fastest ones.

Laurence Moroney makes crucial and thoughtful additions to these. About the hard working part, he said hard work ≠ amount of time spent. It must be something measurable, like what is your output after those hours? X new products? Y papers read and understand the papers properly? And about your surroundings, he reminded you that those people also will choose if they want to see you around. Even if you are a 10x engineer, if you are a rude person, people won’t want to see you. After these additions, he talked about the 3 pillars of success that you need to show to the employers, not just tell.

Understanding Depth

Surface-level knowledge is not enough anymore. You need to have academic knowledge, diverse skills, and the ability to separate noise from the real trends because engagement is the currency of social media, not the accuracy, so there will be a lot of noise there. By diverse skills, he doesn’t mean knowing both about NLP and CV; he means knowing about training ML models while also knowing about how to deploy them, scale them, and build an application on them to be valuable even if the AI hype completely deflates tomorrow. He also gave a practical strategy to filtering noise: Develop trusted sources and filter them actively. Learn more about the fundamentals of the hyped tech (enters the academic knowledge) before judging its impact (e.g., Hollywood is over, SWE is over). And always aware of the trends and know why it is a trend right now. As an example, think about “AI Agents” before directly going into implementation, understand how they work, “when” and “why” it adds value, and when it won’t add any value. And how will it help?

Business Focus

You need to translate the capabilities of AI into a real business outcome. If you go directly with the hype, aka agents, these days, you will fail. First, you need to peel apart the business requirements, ask “why” and “what” a lot of times to understand the real bottleneck/problem. Additionally, risks of mispredictions, hallucinations, biases, and misuses (some edge cases you will never think of will be found by the users) are here and will stay here. Knowing how to manage those risks while making a process an AI-enabled process is critical.

Bias Towards Delivery

Building cool things that have no value is not that important anymore (as it was when hype started). You need to build useful things; if it is both useful and cool, it is definitely better. Show that you can ship working solutions more than demos. An example from him: Before applying to Google Cloud, while he was writing a Java book, he made a Java application that runs on the Google Cloud and showed it on the interview, which turned the interview process into questions about his app instead of weird questions like how many windows in New-York.

And some additional points:

You will make mistakes, so learn from them and also be helpful when someone else makes a mistake
Vibe coding is good unless you mindlessly copy-paste the code. Every time you use AI to generate code, you are taking technical debt.
Learning how to fine-tune those small LLMs is currently one of the most important things, due to privacy reasons in a lot of industries

If you want to watch:

Pytorch Geometric Basics: How Message Passing Works

Nusret Ozates — Fri, 17 Oct 2025 21:00:00 GMT

I’m working with GNNs for my MSc thesis and naturally chose PyTorch Geometric (PyG), one of the most popular libraries in the field. While PyG is incredibly easy to use, I realized I needed to understand how message passing works under the hood to effectively customize it for my specific experiments. Now that I’ve gained this understanding, I’ve written this post to share the inner workings of PyG with anyone else looking to build their own custom layers.

Introduction to Message Passing in GNNs

For those unfamiliar, Graph Neural Networks (GNNs) are a class of neural networks designed to operate on graph-structured data. They leverage the relationships between nodes (entities) and edges (connections) to learn representations that capture both local and global graph structures. Message passing is a fundamental operation in GNNs, where information is exchanged between nodes and their neighbors to update node representations.

Note

PyG uses “source_to_target” flow by default, meaning messages are sent from source nodes to target nodes. Source nodes are typically denoted with a subscript “_j” and target nodes with “_i”. You can remember it like: Source = neighbors, Target = self.

In PyG, message passing is typically implemented using the MessagePassing class, which provides a flexible framework for defining custom message-passing schemes. It has 4 important methods:

1. Propagate

This function is responsible for orchestrating the message-passing process. It takes an edge index, a.k.a adjacency matrix, as a mandatory parameter. You can (and probably must) give feature matrix x. In addition, you can pass any other necessary data for the later steps we will see. You don’t update/override this function; you pass the necessary data to it to be further used in the next steps.

Some example parameters:

propagate(edge_index, x=x)
propagate(edge_index, x=x, edge_attr=edge_attr)
propagate(edge_index, x=x, norm=norm)

Important

Choosing the name for the feature vector x is critical. If you use x=x like the example above, PyG will automatically split it into x_i and x_j for target and source nodes, respectively. If you use feature_vec=x, you should use feature_vec_i and feature_vec_j in the later steps’ parameter names.

2. Message

This is where you create a message for the source node from neighboring nodes. This function takes x_j as input by default, which is the feature vector of the source nodes. This means you have to give your feature matrix named as x=x in the propagate function, or override the parameter name in the message function.

You can also access any other data you passed in the propagate function, such as edge attributes or normalization factors. For example, in the second example of the propagate function, you give norm parameter, you can access it in the message function as norm.

Some example parameters:

message(x_j)
message(x_j, norm)
message(x_j, x_i, norm, edge_index, x)

Also, an example implementation that normalizes the messages by their node’s degree:

def message(self, x_j, norm):
    # x_j has shape [E, out_channels]
    print("Creating messages...")
    print(f"x_j shape: {x_j.shape}")
    print(f"norm shape: {norm.shape}")

    # Step 4: Normalize node features.
    return norm.view(-1, 1) * x_j

3. Aggregate

Now that you have messages from your neighboring nodes, this is where you aggregate those messages. This method calls the Aggregator object of the class by default, which is set to “add” by default. You can change it to “mean” or “max” when you initialize your custom MessagePassing class or implement your own aggregate function.

You can also override this method to implement your own aggregation logic. By overriding this method, you can weight the messages using any data you want, before using the default sum aggregation as an example.

It takes the following parameters:

inputs which is the messages created in the message function
index that says the target node each message belongs to

And whatever you want from the propagate function.

4. Update

This is the final step where you update the target node features using the aggregated messages. Depending on your architecture, you might do nothing here and return the aggregated messages, such as when you add self-loops. Alternatively, add the source node features to the aggregated messages or pass them through a neural network layer.

It takes inputs, which is the aggregated messages from the aggregate function and whatever you want from the propagate function.

Conclusion

In this blog post, we explored how message passing works in PyTorch Geometric by breaking down the key methods of the MessagePassing class: propagate, message, aggregate, and update. If you want to customize your GNN architecture and experiment with different message-passing schemes, understanding these methods is critical. With this knowledge, you can implement your own GNN layers and tailor them to your specific needs. I will drop a simple working code that I’ve borrowed from the PyG documentation below for reference.

from typing import Optional

import torch
from torch import Tensor
from torch.nn import Linear, Parameter
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree
from torch_geometric.data import Data

class GCNConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super().__init__(aggr='add')  # "Add" aggregation (Step 5).
        self.lin = Linear(in_channels, out_channels, bias=False)
        self.bias = Parameter(torch.empty(out_channels))

        self.reset_parameters()

    def reset_parameters(self):
        self.lin.reset_parameters()
        self.bias.data.zero_()

    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]
        print("Forward pass...")
        print(f"x shape: {x.shape}")
        print(f"edge_index shape: {edge_index.shape}")

        # Step 1: Add self-loops to the adjacency matrix.
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))

        # Step 2: Linearly transform node feature matrix.
        x = self.lin(x)

        # Step 3: Compute normalization.
        source, target = edge_index
        deg = degree(target, x.size(0), dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
        norm = deg_inv_sqrt[source] * deg_inv_sqrt[target]

        # Step 4-5: Start propagating messages.
        out = self.propagate(edge_index, x=x, norm=norm)

        # Step 6: Apply a final bias vector.
        out = out + self.bias

        return out

    def message(self, x_i, x_j, norm, edge_index):
        # x_j has shape [E, out_channels]
        print("Creating messages...")
        print(f"x_i shape: {x_i.shape}")
        print(f"x_j shape: {x_j.shape}")
        print(f"norm shape: {norm.shape}")

        # Step 4: Normalize node features.
        return norm.view(-1, 1) * x_j

    def aggregate(
        self,
        inputs: Tensor,
        index: Tensor,
        ptr: Optional[Tensor] = None,
        dim_size: Optional[int] = None,
    ) -> Tensor:
        print("Aggregating messages...")
        print(f"Inputs shape: {inputs.shape}")
        print(f"Index shape: {index.shape}")
        print(index)
        return super().aggregate(inputs, index, ptr, dim_size)

    def update(self, inputs: Tensor) -> Tensor:
        print("Updating node embeddings...")
        print(f"Inputs shape: {inputs.shape}")
        print(inputs)
        return super().update(inputs)


edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [1, 2],
                           [2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index.t().contiguous())

conv = GCNConv(1, 2)
out = conv(data.x, data.edge_index)
print(out)

References

Thanks for coming so far, have fun!

Academic Writing Notes: Paragraphs Development and Sentence Skills

Nusret Ozates — Sun, 12 Oct 2025 21:00:00 GMT

Do you remember the times you read an article/paper and you couldn’t understand what the author wanted to say even though you know the underlying concepts? Even the most brilliant ideas can be lost in poor writing or simplest ideas can be hard to understand. Choppy paragraphs, misplaced phrases, and grammatical run-ons can obscure your argument and frustrate your reader.

In this post, I will share what I learned from Koc University Academic Writing class videos and materials.

Topic and Stress

You can divide a sentence into two parts: the topic and the stress. The topic is what the sentence is about, and the stress is what you want to say about the topic or what is new information.

Topic Part, Readers:	Stress Part, Readers
- Expect to understand what the sentence is about.	- Expect to see new and imporant ideas.
- Try to connect the sentence to what they have already read.	- Focus most of their interpretative effort.

Example:

Accounts of depression evolved after psychologists introduced the concepts of defeat and entrapment.

Known-New Contract

Introduce your readers to the “big picture” first by giving them information they already know (the topic part).
Then they can link what’s familiar to the new information you give them (the stress part).

As that new information becomes familiar, it too becomes old information that can link to newer information.

Example:

Accounts of depression evolved after psychologists introduced the concepts of defeat and entrapment. These concepts have been implicated in theoretical accounts of anxiety and suicide. Such theories..

Example 2:

However, managed WebRTC services using SFU architecture and SDN-assisted IP multicasting of scalable video within WebRTC system are discussed for the first time in this paper

Important

The sentence above starts with a lot of complex terms and we don’t know their importance until the end of sentence, which is bad. The simple fix is just reversing the order!

However, this paper is the first to analyze managed WebRTC services using SFU architecture and SDN-assisted IP multicasting of scalable video within WebRTC system

What If I Want to Stress Multiple Ideas in a Sentence?

Try to introduce just one major idea per sentence, especially if the idea is complex.
If your text is complex and you have two ideas worth emphasizing, create two sentences.

Example:

However, it uses the already limited upload bandwidth of clients inefficiently and is not scalable with the number of clients, i.e., it becomes impractical as the number of endpoints grows bigger.

Instead of the sentence above, you can write:

However, mesh topology uses the already limited upload bandwidth of clients inefficiently. It is not scalable with the number of clients, i.e., it becomes impractical as the number of endpoints grows bigger.

Sometimes two sentences should be one if they refer to the same idea.

Example:

Such leaders should make the work of their followers more pleasant. Moreover, they should treat the followers as equals, and respect them.

Instead of the sentence above, you can write:

Such leaders should make the work of their followers more pleasant by treating them equally and respectfully.

Additional Steps to Edit Complex Writing

Move the subject and the verb close together.
Break apart sentences that contain too much new information.
Use transitional phrases to indicate relationships: moreover, in addition, consequently, therefore…

Run-Ons

A run-on is two complete thoughts run together with no sign to mark the break between them or with just a comma:

Then, in [2], they also presented a bisection algorithm to compute -pseudospectral abscissa of a fixed matrix, i.e. , and tried to compute minimum -pseudospectral abscissa over feasible matrices, however, an algorithm wasn’t presented yet.

Then, in [2], they also presented a bisection algorithm to compute -pseudospectral abscissa of a fixed matrix, i.e. . They also tried to compute minimum -pseudospectral abscissa over feasible matrices. However, an algorithm wasn’t presented yet.

Note

I personally didn’t like the second version too because it has too many “they also” parts.

Fragments

A sentence fragment is a group of words that lacks a subject or a verb and does not express a complete thought:

Purdue offers many majors in engineering. Such as electrical, chemical, and industrial engineering.

Purdue offers many majors in engineering such as electrical, chemical, and industrial engineering.

Parallelism

Words in a pair or series should have a parallel structure.

Not Parallel: The production manager was asked to write his report quickly, accurately, and in a detailed manner.

Parallel: The production manager was asked to write his report quickly, accurately, and thoroughly.

Misplaced Modifiers

Misplaced modifiers do not describe the word in the way the writer intended because of their wrong place in a sentence.

George couldn’t drive to work in his small sports car with a broken leg.

With a broken leg, George couldn’t drive to work in his small sports car.

In this example, we and transformer models know that George has a broken leg, not the car. But grammatically, the modifier “with a broken leg” seems to describe the car. This is an easy example but in a academic text, it can be more complex and harder to spot.

In order to avoid misplaced modifiers, place the words as close as possible to what they describe.

Dangling Modifiers

A modifier that opens a sentence must be followed immediately by the word it is meant to describe. Otherwise the sentence takes on an unintended meaning.

While smoking a pipe, my dog sat with me.

While smoking a pipe, I sat with my dog.

While I was smoking a pipe, my dog sat with me.

Again, this is also an easy example but in a academic text, it can be more complex and harder to spot.

Sentence Variety

Too many sentences with the same structure and length can grow monotonous for readers.
Varying sentence style and structure can also reduce repetition and add emphasis.
Long sentences work well for incorporating a lot of information, and short sentences can often maximize crucial points.

Overusing Long Sentences

Long sentences can be difficult to read and understand, especially if they contain multiple ideas or clauses. Breaking up long sentences into shorter ones can improve clarity and readability.

The company reported that yearly profit growth, which had steadily increased by more than 7% since 1989, had stabilized in 2009 with a 0% comp, and in 2010, the year they launched the OWN project, actually decreased from the previous year by 2%. This announcement stunned Wall Street analysts, but with the overall decrease in similar company profit growth worldwide, as reported by Author (Year) in his article detailing the company’s history, the company’s announcement aligns with industry trends and future industry predictions.

The company reported that profit growth stabilized in 2009, though it had steadily increased by more than 7% since 1989. In 2010, the year they launch the OWN project, company profit growth decreased from the previous year. This announcement stunned Wall Street analysts. According to Author (Year), however, this decrease is exemplar of a trend across similar company profit growth worldwide; it also supports future predictions for the industry.

Note

Notice that the sentence count only increased by two, but thanks to the choice of where a sentence begin and end, the paragraph is easier to read. Moreover, the sentence variety is increased.

Short Sentences

Read the text below with your voice:

Too many short sentences can hurt an essay. They can make the writing seem choppy. The writing may seem like it is below a college level. Readers may lose interest. They may not want to continue reading.

See the effect? Let’s fix it:

Too many short sentences can hurt an essay, for it can make the writing seem choppy and below a college level. Because of this, readers may lose interest and not want to continue reading.

Change the Rhythm!

Change the rhythm of your writing by varying sentence length and structure. As you will see the example below, varying sentence length and structure can make your writing more interesting and engaging.

Vary the rhythm by alternating short and long sentences:

The Winslow family visited Canada and Alaska last summer to find some Native American art. In Anchorage stores they found some excellent examples of soapstone carvings. But they couldn’t find a dealer selling any of the woven wall hangings they wanted. They were very disappointed when they left Anchorage empty-handed.

The Winslow family visited Canada and Alaska last summer to find some native American art, such as soapstone carvings and wall hangings. Anchorage stores had many soapstone items available. Still, they were disappointed to learn that wall hangings, which they had especially wanted, were difficult to find. Sadly, they left empty-handed.

I think I see something similar to this in the novels I’ve read.

Repeated Subjects or Topics

Handling the same topic for several sentences can lead to repetitive sentences. When that happens, consider using these parts of speech to fix the problem:

Relative pronouns

Indiana used to be mainly an agricultural state. It has recently attracted more industry. Indiana, which used to be mainly an agricultural state, has recently attracted more industry.

Participles

Wei Xie was surprised to get a phone call from his sister. He was happy to hear her voice again. Surprised to get a phone call from his sister, Wei Xie was happy to hear her voice again.

Prepositions

The university has been facing pressure to cut its budget. It has eliminated funding for important programs. Under pressure to cut its budget, the university has eliminated funding for important programs.

Final Words

Finally, all these rules and tips are not strict rules, but guidelines to help you improve your writing. Also, it is very easy to forget them. The best thing you can do is keep writing constantly, while editing your own writing with these rules in mind. Over time, you will internalize these rules and your writing will improve. I also believe that reading a lot of well-written articles and books will help you improve your writing skills.

References

Koc University Writing Center
Sentence Variety. (2018). Retrieved from https://owl.purdue.edu/owl/general_writing/academic_writing/sentence_variety/index.html
Making Complex Writing Intelligible with Known-New Contract. (2018). Global Communication Center, Carnegie Mellon University. Retrieved from https://www.cmu.edu/gcc/handouts/old-new-handout-pdf

Test-Time Compute, Reasoning and Human Brain

Nusret Ozates — Thu, 06 Feb 2025 21:00:00 GMT

I have lots of things to do, but I’ve suddenly been struck by inspiration, and since the well-known work-avoidance mechanism has kicked in, I’m going to write down my thoughts on test-time compute, shared decoders, and reasoning. Here we go, a theoretical and lengthy piece is coming.

One of the things I love about LLMs is that they handle multiple tasks with a single loss and a single branch. That is, instead of a shared encoder + separate decoders for each task like in U-net models, there’s just a transformer decoder. As far as I know, this comes from T5 models where we model all NLP tasks as text-to-text. We’ll get back to LLMs in a bit, but let’s take a look at the vision side for now.

I don’t think there’s an equivalent of this in vision, for example, how would you combine segmentation and classification tasks? At the very least, the last layers would have to be different. However, there’s a model that comes very, very close to combining these, and the model’s developers are inspired by the human brain. The model is called BU-TD. Let’s start with the inspiration part. These folks are saying, ‘segmenting anything and everything at once’ is not the right approach; the human brain doesn’t work that way.

For example, the longer you look at the image above, the more details emerge; the brain doesn’t grasp the entire image with all its details at once. So why are we trying to make models do this? This is where BU-TD comes into play.

Again, in this model inspired by the human brain, first an encoder processes the image, and the decoder takes the vector coming from the encoder and additionally receives a task and argument vector. For example,

We tell the model to look at the image, task: find hairstyle, argument: bob. Afterwards, the decoder can optionally output the segmentation of Bob’s hair, but it may not; that part is a bit vague. By feeding the decoder’s result back into the encoder, we get the result corresponding to that style, and so on.

The nice thing about this system is that it first uses the weights in the decoder to the fullest, making it a parameter-efficient model. Secondly, since each task, argument, etc., is decoupled, the system learns concepts better. For example, during training, brother Bob is always bald, so Bob’s hairstyle value is always bald during training. However, since the model learns what short hair and long hair are from other examples independently of the person, it can detect this during testing if Bob comes to Turkey and gets hair implants. Thanks to this, the model learns much better with much less data.

Now we come back to LLMs. We are actually applying the TD part of this model, that is, using a single decoder for each task, in LLMs since the T5 models, especially in PrefixLM models, and when image tokens and (I think) the question are processed with self-attention, the TD logic is formed.

The second part is that when people look at a picture, we extract the details, relationships, etc. over time; all the details don’t come at a single glance, right? Well, this part actually corresponds to the concept we call test-time compute. For example, as we look at the image on the left below, the details on the right emerge.

Based on what we’ve learned up to this point, trying to process everything at once is not logical. Learning the relationships one by one during training and processing the image over time using more compute/time during testing is an effective solution in terms of both the number of parameters and learning more with less data. Our next problem is this: We don’t want to process the image all at once, okay, but we also don’t want to process the entire image; we want to process as much as necessary for the information we are interested in to save time and money. This is where reasoning comes into play. What does a good LLM model do in terms of reasoning? It divides the question we ask into the necessary parts and solves the parts step by step, and stops when it reaches the result. By doing this, we have the following system:

I’m saying that if I were shown a picture and asked, ‘What is the size of the bag of the woman holding the bag that the girl is looking at?’, my brain would process the image in a similar way and wouldn’t look for more details.

I think the BU-TD model itself is very limited in terms of input and output ranges but the main idea is still strong and I believe VLMs are very close to implementing this idea into the AI models.

That’s it, I couldn’t come to a conclusion with the text. These things suddenly came to my mind while reading a very unrelated article, I thought I’d write them down. Good luck and thanks for reading up to this point!

References

BERT: Encoder Stack Is All You Need

Nusret Ozates — Wed, 23 Jun 2021 21:00:00 GMT

In this article, I will talk about the BERT model and I will add sources to learn more about the BERT. To understand how BERT works, you need to understand how transformer models work. You can read my article about Transformers or you can learn it from here.

BERT [1] (Bidirectional Encoder Representations from Transformers) is a new language representation model. It is one of the ways of creating word embeddings and sentence representation vectors. BERT uses the encoder stack of the transformer model to output the representation of each token in the given input. Additionally, it has a special token “[CLS]” at the beginning of the inputs to use in classification tasks. BERT has two steps:

Pre-training
Fine-tuning

The pre-training step is where the model is trained to learn the given language(s) and output meaningful representation of the given input sequence (“sequence” term especially used here to emphasize given input can be more than one sentence.). To do this step, the model is trained using unlabeled text data using two different tasks. For fine-tuning, labeled data is needed and the model is fine-tuned to do given downstream tasks such as classification or question answering.

Pre-training

Pre-training is done with two different tasks:

Masked Language Model (MLM)
Next Sentence Prediction (NSP)

Masked Language Model

According to the authors of the BERT model, deep bidirectional models are powerful than unidirectional models. The problem with bidirectional models is, it allows to each word indirectly “see itself” and the model can easily predict the target word. To solve this problem, the authors mask some percentage(%15 in the article) of the input tokens randomly by changing the word with the special token “[MASK]” and the model is tasked with the prediction of the masked tokens. This is called the “Masked Language Model” but it is called a “cloze test” (fill in the blanks questions) in literature. The output of the hidden vectors (only at the masked tokens’ positions) are given into a softmax layer with the neuron size is equal to vocabulary size to predict real words.

The “[MASK]” token is used for pre-training but not in fine-tuning section. This causes a mismatch and to solve this problem, the authors choose to not using the “[MASK]” token every time. The selected token i with the %80 probability, will be replaced with the “[MASK]” token, with %10 probability it will be replaced by a random token, and with %10 probability, it will remain unchanged. Then the token in the i’th position will be used to predict the original token.

An example of the masked language model from Jay Alammar [2]

Next Sentence Prediction

Some NLP tasks such as Question Answering and Natural Language Inference(Determining whether a “hypothesis” is true ( entailment), false (contradiction), or undetermined (neutral) given a “premise”.) need to understand the relation between sentences. This relation cannot be captured with the masked language model task. To train a model that can capture the relationship between two sentences, the authors give two sentences “Sentence A” and “Sentence B” as input. With the %50 probability, “Sentence B” is the actual sentence that comes after “Sentence A” and with the %50 probability it is a random sentence. As in the example above, the special “[CLS]” token is used to predict if the second sentence comes after the first sentence.

An example of the next sentence prediction model from Jay Alammar [2]

Fine-tuning

It is a very simple step, by swapping the input depending on the task (single sentence or two sentences) and connecting the output to an appropriate classification layer. For a sentence classification task, the output vector of the model’s “[CLS]” token could be connected to a softmax layer and

Freeze the BERT weights and train the latest classification weights
Fine-tune all weights

Code Example

Using Tensorflow Hub, training or fine-tuning BERT models is very easy. In the following steps, I will show you how you can use a BERT model to detect toxicity in texts for the Toxic Comment Classification Challenge. Download the train data first and create a Tensorflow dataset, separate it as train and validation set using this code:

import tensorflow as tf

dataset = tf.data.experimental.make_csv_dataset(
    'data/kaggle_toxic_comments/train.csv', batch_size=batch_size, num_epochs=1
    , select_columns=['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'])

dataset = dataset.unbatch()
validation = dataset.take(10000)
train = dataset.skip(10000)

train = train.map(lambda data: (
    dataset_preprocessing(data['comment_text']),
    [data['toxic'],
     data['severe_toxic'],
     data['obscene'],
     data['threat'],
     # data['sexual_explicit'],
     data['insult'],
     data['identity_hate'],
     ])).batch(batch_size).cache().prefetch(tf.data.AUTOTUNE)

validation = validation.map(lambda data: (
    dataset_preprocessing(data['comment_text']),
    [data['toxic'],
     data['severe_toxic'],
     data['obscene'],
     data['threat'],
     # data['sexual_explicit'],
     data['insult'],
     data['identity_hate'],
     ])).batch(batch_size).cache().prefetch(tf.data.AUTOTUNE)

return train, test

Note from 2026

The code above is too old and you would likely use huggingface transformers library to do the same task today.

In the code above, I did some preprocessing but you can just use “data[“comment_text”]”.

Let’s create the BERT model using the Tensorflow Hub:

from tensorflow.keras.layers import *
import tensorflow_hub as hub

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4",
    trainable=False)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]  # [batch_size, 512].
sequence_output = outputs["sequence_output"][:, 0, :]

Now we have two different outputs, pooled output, and sequence output. The pooled output represents each input sequence as a whole, and the sequence output represents each input token in context. Either of those can be used as input to further model building. For this task I want to use the output of the [CLS] token so I connect the sequence_output to a sigmoid layer and create the model like this:

classification_output = Dense(6, activation='sigmoid')(sequence_output)

embedding_model = tf.keras.Model(text_input, classification_output)

Now compile the model and train it:

embedding_model.compile(
    optimizer=tf.keras.optimizers.Nadam(learning_rate=0.025),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.keras.metrics.AUC()],
    run_eagerly=False

)

embedding_model.summary()

embedding_model.fit(x=train_data, validation_data=test_data, epochs=2)

Thanks for reading!

References

https://arxiv.org/pdf/1810.04805.pdf - The BERT paper.
http://jalammar.github.io/illustrated-bert/ - Awesome and more detailed explanation of BERT model
https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4 - An example BERT model from Tensorflow Hub

Transformer Architecture: How Transformer Models Work?

Nusret Ozates — Fri, 19 Feb 2021 21:00:00 GMT

Before Transformers, RNNs with attention mechanisms were state-of-the-art approaches to language modeling and neural machine translation. But RNNs have a very critical problem. The structure of RNN does not allow to do parallel computing. There were some optimizations to make them faster but the main problem remains. The Transformer model completely removed RNNs and built all architecture based on attention mechanism.

Like most neural machine translation models, Transformers have an encoder-decoder structure. It uses stacked encoders that contain attention layers and Feed Forward Neural networks.

Encoder and Decoder Stacks

Encoder

Each encoder layer has two sub-layers, the “multi-head attention layer” (It will be explained in the next chapters) and a Feedforward Neural Network, both of them followed by a normalization layer and there is 2 residual connection around each sub-layer. Each encoder layer has the same structure but they do not have the same weights. To use residual connections, all sub-layers and embedding layers output the same dimension. We denote this dimension as parameter . In the original article . The first encoder’s input is embedding vectors of the source sentence with position information injected (will be explained in the positional encoding subsection). Each of the other encoders’ inputs is the output of the encoder below.

In summary: An encoder receives input as a list of vectors. Then it processes these vectors by passing them to a self-attention layer, then sends them to a Feedforward Neural Network. Finally, the output goes to the next encoder. The last encoder sends its output to every decoder. The number of Encoders is a hyperparameter.

The word at each position goes through a self-attention process. Then, each result pass through a Feedforward Neural Network (Same DNN but each vector pass through separately) In this example, there are two words but the maximum number of words (that can be given to the model) is a hyper-parameter.

Decoder

The structure of the decoder is nearly the same as the encoder. The difference is, decoder layer has one more sub-layer that contains a “masked” multi-head attention layer. When predicting the position “i” we need to be sure about we are attending the known outputs at position < “i”. The Transformer model is auto-regressive, it makes predictions one part at a time, and uses its output to decide what to do next.

During training, we are using teacher-forcing. Teacher forcing is passing the true output to the next time step regardless of what the model predicts at the current time step. As the Transformer predicts each word, self-attention allows it to look at the previous words in the input sequence to better predict the next word. To prevent the model from peeking at the expected output the model uses a look-ahead mask. This is what we call a masked multi-head attention layer. For a target sentence with 4 words, a look-ahead mask look like this:

[[0, 1, 1, 1],
 [0, 0, 1, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 0]]

We are using this matrix as below

scaled_attention_logits -= mask * 1e9

Here, scaled_attention_logits is the normalized result of matrix multiplication of query and key. We make the values from “future” near to zero by appending -1e9 to the values that need to be masked. With that, values from the future will have no impact when calculating attention value.

Important

Why we are using teacher forcing and give the true output to the model?

Short answer: Otherwise we would have to run the decoder a number of times equal to the number of words in the target sentence.

Longer answer: The secret lies in the implementation. Let’s say the true output is a three-word sentence and embedding vector dimension is five and our vocabulary size is twenty. Lastly, our batch size is 1 for ease of explanation. After passing these words from the embedding vector, we have a 1,3,5 dimensional matrix. After all of the decoders’ calculations, we have a vector that has the same dimension. As we can see in the Transformer model architecture we will pass this vector to a final linear/dense layer. The number of neurons in this layer is twenty because it is our vocabulary size. That means we are multiplying [1,3,5] dimensional matrix with [5,20] dimensional matrix. The result is [1,3,20] dimensional matrix. That means, every word predicted the next word by running the decoder one time. How do we calculate loss? With Sparse Categorical Cross Entropy. This explanation below is the documentation of SparseCategoricalCrossEntropy class in Tensorflow:

Use this crossentropy loss function when there are two or more label classes.
We expect labels to be provided as integers.There should be ‘# classes’ floating point values per feature for ‘y_pred’ and a single floating point value per feature for ‘y_true’.
Example:

y_true = [1, 2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]

The Transformer model architecture

Self-Attention

Let’s suppose we want to translate the following sentence:

“The books were on the shelves because they are old.”

What does “they” in this sentence refer to? Is it referring to the books or the shelves? It’s a simple question to a human, but not as simple as an algorithm. Self-attention allows the model to associate “they” with books.

An attention mechanism maps a query and a key-value pair to an output. All of them are vectors. The way to calculate output is the weighted sum of the values. Weights are computed by a function that uses query and key as input. These weights are not trainable, query and key values are calculated by trainable and learned weights.

Query, Key, and Value are created from the encoder’s input vectors(e.g. embedding of input words for the first layer) by multiplying these input vectors with 3 different weight matrices , , that can be trained during the training process. Every layer (both encoder and decoder layers) in the Transformer model has its , , matrices.

Calculation of Query, Key and Value vectors

How Do We Use Query, Key, and Value to Calculate Attention?

Let’s suppose we are calculation the self-attention for the word “better”. We will use “Scaled Dot-Product”.

We compute dot products of the query with all keys
The results will be divided by where is the dimension of the key vectors. This is called scaling.
Lastly, we apply the softmax function to the result from step two.

This will tell us, how much attention we need to give to each word to encode “better”. For faster processing, Queries are packed in the matrix Q, values are packed in the matrix V and keys are packed in the matrix K. The general scaled dot-product attention formula is:

Scaled Dot-Product Attention

Multi-Head Attention

The idea/question behind multi-head self-attention is: “How do we improve the model’s ability to focus on different features of the input sentence?”.

With 1 head self-attention, the encoding could be dominated by the actual word.
Multi-head attention allows the model to learn different semantic meanings of attention. Eg. One for grammar, one for vocabulary, etc.

For example, if we have head=8 (referenced as h), there will be 8 different Q, K, and V matrices and 8 different outputs. , weights will have dimension:

and the outputs will be concatenated and multiplied by another weight matrix with dimension to get the final

Multi-Head Attention: We embed each word, create 8 “attention heads”, multiply X with weight matrices and find 8 different Key, Query, and Value matrices, calculate attention vector using these matrices, concatenate these matrices and multiply them with weight matrix to get the output. This output is the output of the (for example) Multi-Head Attention layer and we will pass this output to the “Add and Normalize” layer.

Positional Encoding

As the Transformer model doesn’t use recurrence, there isn’t any position information(time step t) in the model input. It is just a bag of words. To provide the model this important information, we inject(add) some information vector( positional encoding) to each embedding. These positional encoding vectors have the same dimension as embeddings.

One way to add this information is having a second embedding, as we have an embedding vector for “computer”, why shouldn’t we have an embedding for position 3? With this approach, the model can learn positional embeddings during training. The problem with this approach is, say we have 1000 training examples and the maximum input length is 50. If 800 of these training examples have the length of 30, the other 20 embedding vectors will be poorly trained and may not generalize well in practice.

As an alternative, we can use a static function that takes an integer input and give a vector in a way that captures the relationship between positions. This function should give that position 8 is more related to position 9 than position

In the Transformer model, authors decided to use sine and cosine functions of different frequencies:

Here, pos is the position of the token, and i is the dimension. Each dimension of these positional encodings corresponds to a sinusoid. (A curve similar to the sine function but possibly shifted in phase, period, amplitude, or any combination thereof.) The wavelengths form a geometric progression from . This function is chosen by authors because they believe(hypothesized) this would allow the model to easily learn to attend by relative positions since for any fixed offset k, can be represented as a linear function of .

That’s all I know :) Thank you for reading!

References

http://jalammar.github.io/illustrated-transformer/ - An excellent overview of Transformer models.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ - A detailed explanation of positional encodings.
https://arxiv.org/pdf/1409.0473.pdf - Neural Machine Translation by Jointly Learning to Align and Translate.
https://arxiv.org/abs/1706.03762v5 - Attention Is All You Need.

Attention mechanism and how it works on neural machine translation

Nusret Ozates and Hasan Kemik — Sat, 12 Dec 2020 21:00:00 GMT

In neural machine translation, we aim to find a sentence y that maximizes probability of y given source sentence x. We basically find . Before transformers, we were using RNNs with a encoder-decoder approach. An encoder read the inputs and return a fixed-length vector to decoder. The decoder uses this vector as a starting hidden state and outputs a translation from that encoded vector.

RNN Encoder-Decoder

Let’s look at the encoder-decoder architecture more formally.

Encoder

The encoder, reads an input sequence(vector sequence) and “encode” these sequence into a vector c. To calculate c, we calculate hidden states with:

and vector c is equal to:

is the hidden state at time t, c is a vector, generated from the sequence of hidden states, f and q are some non-linear functions. As an example, f can be an LSTM and can be equal to , the last hidden state of the encoder.

Decoder

Note

In the article, hidden states of encoder are interpreted as “h” and in the decoder, author choose to use “s” for decoder’s hidden state.

The decoder predicts the next word , given context vector c and all other previously predicted words. Finally, the probability of the translated sentence y is calculated with:

How do we calculate ?

where g is a non-linear function that outputs the probability of and is the hidden state of the RNN.

Attention Mechanism: A new approach to Encoder-Decoder Structure

Encoder

In the encoder structure, the hidden states are calculated as Eq.1. Hence, in order to understand the context not only according to previous words but also according to next words, the sentence is traversed 2 times. First one is from beginning to end, second one is from end to beginning. For this operation, Bidirectional RNNs(BiRNNs) are used. This operation creates two different vectors. By concatenating these two vectors, we have our new hidden state vector h, which h contains summarized knowledge of forward and backward words.

Decoder

In the decoder structure, probability of each word is computed according to previous word’s vector , hidden state and context vector .

Hidden state is computed according to previous hidden state , previous word’s vector and current context vector .

The context vector depends on all hidden states with their weights. These weights represent the amount of “attention” needed to be given to predict next word .

where is the weight for hidden state when predicting word . The “attention amount” calculated by the formula:

represents “energy”, the importance of hidden state , respect to the previous hidden state . It is calculated by concatenation of and is fed through a feedforward neural network a, which allows to train an alignment model through backward propagation.

And with this approach, a new era of NLP has started. This attention mechanism become a basic building block of the famous transformer models.

References

Neural Machine Translation by Jointly Learning to Align and Translate

Things you need to know about Docker to get started

Nusret Ozates — Sat, 07 Nov 2020 21:00:00 GMT

Some useful commands and concepts to use Docker!

The original video that I take notes from :

In this article, I‘ll talk about Docker. We will begin from why we need to use it, to how do we manage multiple Docker containers at the same time.

Why do we need Docker?

We have web servers, database services, messaging services, etc. and all of them have their dependencies(libraries, OS version, etc.) and there can be a conflict between them. We call it “The matrix from Hell”.

The matrix from Hell

What does docker do?

Run each component in a separate, isolated environment with its dependencies and its libraries. All within the same VM or host.

What are the differences with VM?

VMs are complete isolation! They have their hardware, kernel, and OS. But docker containers use the same hardware and same Linux kernel.

That is the reason why you can’t have a Windows container. You can say: “Hey! I have a docker on windows!”. Then I say, look for WSL. 😄
Containers meant to run a specific task or process, not meant to host an OS.

Virtual Machines vs Containers

Some Useful Docker Commands

docker version: It gives the docker version.
docker run: It is used to run a container from an image
- docker run nginx ⇒ Runs instance of the Nginx application on the docker host
- docker run -d nginx ⇒ Runs in the detached mode. That means the container will run in the background, and you can continue to use the terminal
- docker run — name webapp nginx ⇒ Run a container with the given name
- docker run -it nginx ⇒ “-i” gives stdin to docker, you can get input from the terminal. “-t” gives terminal so your dockerized app can print something
- docker run -v /opt/datadir:/var/lib/mysql ….. ⇒ The container maps /var/lib/mysql(in docker) to /opt/datadir(in your pc). Your data will persist even when you delete the container.
- docker run -p 80:5000 nginx ⇒ Forward your port 80 to container’s port 5000.

Note: You can’t bind the same host port to the multiple docker instances.

docker ps: List all running containers and several key information about them. If used with the “-a” parameter, you can see previously stopped or exited containers.
docker stop: It stops the running containers. Needs container ID or name.
- docker stop silly_sammet
docker rm: Removes stopped or exited container permanently. If it prints the name back, we are good.
- docker rm silly_sammet
docker images: Gives a list of downloaded images and their sizes.
docker rmi: Removes the given image. You need to remove all dependent containers before.
- docker rmi nginx
docker pull: Just downloads the images so you won’t wait when you want to run the image.
docker exec: Execute a command in the container.
- docker exec distracted_meclintock(container name) cat /etc/host(command)
docker inspect: It returns all details of the container in JSON format.
- docker inspect webapp
docker logs: This shows the logs of a container. It is useful when your container runs in detached mode

What is this Dockerfile?

Dockerfile is a text file written in a specific format that docker can understand.

How can I export/import my docker image as a tar file?

You can export your Docker Image as a .tar file with this command:

docker save —output chatbot.tar nusret/chatbot

And you can easily import it with a very similar command.

docker load —input chatbot.tar

ENTRYPOINT VS CMD

Let’s say we have a docker container that just “sleeps” named “sleeper”. The docker file would be like this:

FROM Ubuntu  
CMD ["sleep","5"]

When I run the command:

docker run sleeper sleep 10

This CMD command will get replaced with sleep 10. But as this is a sleeper container, I could only say “10” and the container must sleep. To do this we change the dockerfile like this:

FROM Ubuntu  
ENTRYPOINT ["sleep"]

This time when I run:

docker run sleeper 10

The “10” will be appended to the “sleep” command and I can just set the sleep time. But what if I don’t write any number? How can I add a default sleep time?

FROM Ubuntu  
ENTRYPOINT ["sleep"]
CMD ["5"]

Docker Networking

Default network a container gets attached to. A bridged network is a private internal network created by docker on the host. All containers are attached to this network and have their internal IP addresses. Containers can access each other by using these internal IPs. If you want to access any of these containers from the outside world, you need to bind/forward the host’s port to the ports on the Docker network
Another way to access these containers is removing the network isolation between the docker host and the docker container by associating the container with the host’s network.
In the “none” network, the container is not attached to any network, and it is not accessible from external networks or any other docker containers.

Containers can access each other using their names. Docker creates a DNS server that helps containers using each other’s names.

Docker Compose

When we have a complex app that runs with multiple containers, we need to write lots of run commands! But we have docker-compose.

With the latest command at the bottom, you can run all of these images and more! We are using a .yaml file to configure docker-compose.

Let’s say we have a sample application like this:

What you would do without docker-compose:

docker run -d --name=redis redis 
docker run -d --name=db postgres:9.4 
docker run -d --name=vote -p 5000:80 --link redis:redis voting-app 
docker run -d --name=result -p 5001:80 --link db:db result-app 
docker run -d --name=worker --link db:db --link redis:redis worker

With docker-compose, you can write a docker-compose.yaml file like this:

redis:
  image: redis
db:
  image: postgres:9.4
vote:
  image: voting-app
  ports:
    - 5000:80
result:
  image: result-app
  ports:
    - 5001:80
worker:
  image: worker

And run them all with a single command:

docker-compose up

What if some of the images are not already built or not in the DockerHub? Like the “voting-app” in our example, we can change the image key with a build key and specify a location of a directory that contains the application code and Dockerfile.

Change this code:

vote:
  image: voting-app
  ports:
    - 5000:80

To this:

vote:
  build: ./vote
  ports:
    - 5000:80

Docker Compose Versions

Docker-compose evolved over time and now supports a lot more options than it did in the beginning.

Version 1

redis:
  image: redis
db:
  image: postgres:9.4
vote:
  image: voting-app
  ports:
    - 5000:80
  links:
    - redis

It has several limitations. For example, if you wanted to deploy containers on a different network other than the default bridge network, there was no way of specifying that in this version of the file. Also, say you have a startup dependency or startup order of some kind. For example, your database container must come up first, and only then should the voting app be started. There was no way you could specify that in this version.

Version 2

version: 2
services:
  redis:
    image: redis
  db:
    image: postgres:9.4
  vote:
    image: voting-app
    ports:
      - 5000:80
    depends_on:
      - redis

Support for these came in version 2. With version 2 and up, the format of the file also changed. You no longer specify your stack information directly as you did before. It is all encapsulated in the services section.

Another difference is with networking. With version 2, docker-compose automatically creates a dedicated bridged network for this application and then attaches all containers to that new network. All containers are then able to communicate with each other using each other’s service name. So you don’t need to use links.

Version 3

version: 3
services:
  redis:
    image: redis
  db:
    image: postgres:9.4
  vote:
    image: voting-app
    ports:
      - 5000:80

This is the latest version of today. Version 3 comes with support for the docker swarm. There are some options removed and added. To see details: https://docs.docker.com/compose/

Networking in Docker Compose

Let us say we modify the architecture a little bit to contain the traffic from the different sources in separate networks. For example, we would like to separate the user-generated traffic from the application’s internal traffic. So, we create a front-end network dedicated to the traffic from users and a backend network dedicated to the traffic within the application.

This is the .yaml file you need to write:

version: "3"
services:
  redis:
    image: redis
    networks:
      - back-end
  db:
    image: postgres:9.4
    networks:
      - back-end
  vote:
    image: voting-app
    ports:
      - "5000:80"
    networks:
      - front-end
      - back-end
  result:
    image: result-app
    ports:
      - "5001:80"
    networks:
      - front-end
      - back-end
  worker:
    image: worker
    networks:
      - front-end
      - back-end

networks:
  front-end:
    driver: bridge
  back-end:
    driver: bridge

So, that’s it! Please watch the video from FreeCodeCamp’s channel to get more detailed information. Also, take a look at the KodeKlaud’s channel!

Nusret Ozates

My summary of the “Career Advice in AI” Lecture

Understanding Depth

Business Focus

Bias Towards Delivery

Pytorch Geometric Basics: How Message Passing Works

Introduction to Message Passing in GNNs

1. Propagate

2. Message

3. Aggregate

4. Update

Conclusion

References

Academic Writing Notes: Paragraphs Development and Sentence Skills

Topic and Stress

Known-New Contract

What If I Want to Stress Multiple Ideas in a Sentence?

Run-Ons

Fragments

Parallelism

Misplaced Modifiers

Dangling Modifiers

Sentence Variety

Overusing Long Sentences

Short Sentences

Change the Rhythm!

Repeated Subjects or Topics

Final Words

References

Test-Time Compute, Reasoning and Human Brain

References

BERT: Encoder Stack Is All You Need

Pre-training

Masked Language Model

Next Sentence Prediction

Fine-tuning

Code Example

References

Transformer Architecture: How Transformer Models Work?

Encoder and Decoder Stacks

Encoder

Decoder

Self-Attention

How Do We Use Query, Key, and Value to Calculate Attention?

Multi-Head Attention

Positional Encoding

References

Attention mechanism and how it works on neural machine translation

RNN Encoder-Decoder

Encoder

Decoder

Attention Mechanism: A new approach to Encoder-Decoder Structure

Encoder

Decoder

References

Things you need to know about Docker to get started

Why do we need Docker?

What does docker do?

What are the differences with VM?

Some Useful Docker Commands

Tags

What is this Dockerfile?

How can I export/import my docker image as a tar file?

ENTRYPOINT VS CMD

Docker Networking

Docker Compose

Docker Compose Versions

Version 1

Version 2

Version 3

Networking in Docker Compose